PROPOSED MODEL
In the proposed model the formal
knowledge discovery in database process (KDD) is adopted to perform the data
mining task, to get the interesting patters or knowledge from the dataset. The
steps used in the methodology are as follows:-
2. Creating a
target data set from raw data
3. Data
cleaning and preprocessing
4. Data
reduction and transformation
5. Choosing
data mining approach: Classification
6. Choosing the
mining algorithm(s): Genetic Algorithm
§ Initialization
§ Selection
§ Crossover
§ Mutation
§ Termination
7. Data mining:
patterns of interest found by GA
8. Evaluation of Patterns and knowledge presentation Visualization, removing redundant Patterns.
9. Use of
discovered knowledge:
Description of the Proposed Model The
vast amount of raw data has been selected from the medical domain in advance.
Data cleaning, preprocessing, reduction and transformation has been done
through Bioinformatics toolbox in MATLAB then the target dataset has been
created from the raw data. The approach to mine the data used is classification
of patients into normal and cancer patients.
The Initialized populations are
values of type integers that correspond to randomly selected rows of the mass
spectrometry data. Each row of the population matrix is a random sample of row
indices of the mass spec data.
Stochastic
Universal Sampling (SUS) is for the selection of parents for recombination
(crossover). The selection function chooses the parents using roulette wheel
and uniform sampling, based on expectation and number of parents. Given a
roulette wheel with a slot for each expectation whose size is equal to the
expectation. We then step through the wheel in equal size steps, so as to cover
the entire wheel in steps of total number of parents. At each step, we create a
parent from the slot we have landed in. This mechanism is fast and accurate.
Crossover
The crossover function is position
independent crossover. This crossover function creates the crossover children
of the given population using the available parents. In single or double point
crossover, genomes that are near each other tend to survive together, whereas genomes
that are far apart tend to be separated. The technique used here eliminates
that effect. Each gene has an equal chance of coming from either parent. This
is sometimes called uniform or random crossover.
Mutation
The mutation applied is Gaussian mutation.
It class how the GA makes small random changes in the individuals in the
population to create mutated children. Scale controls what fraction of the
gene's range is searched.
A value of 0 will result in no change and a scale of 1 will result in a distribution whose standard deviation is equal to the range of this gene. Intermediate values will produce ranges in between these extremes. Shrink controls how fast the scale is reduced as generations go by. A shrink value of 0 will result in no shrinkage, yielding a constant search size.
A value of 1 will result in scale shrinking linearly to 0 as GA progresses to the number of generations specified by the options structure. Intermediate values of shrink will produce shrinkage between these extremes. We should note that shrink may be outside the interval (0, 1), but this is still advised. If since no values for scale or shrink are specified. The values of scale and shrink are set to 1.
Termination
The algorithm
terminates after the 51 generation or the stall generation limit of 50.
SIMULATION RESULTS AND FURTHER DISCUSSIONS
The proposed model in the paper is
implemented in MATLAB Version 7.0. After implementing the model it is executed on
machine having 2 GB of RAM and core i3 processor. The preprocessing step in the
model took near about 20 to 25 minutes to prepare the data set.
There are three variables in the dataset, they are Y, MZ and grp. There are 15000 rows and 216 columns in Y. The measurements taken from a patient is represented by each column in variable Y. The columns in Y represent 216 patients.
The mass charge value is represented by the rows. There are total 121 cancer patients and 95 normal patients. The ion intensity level at a specific mass charge value in MZ is represented by each row in variable Y. The mass charge values are stored in MZ. There are 15000 mass charge values in MZ and each row in Y represents the ion intensity levels of the patients at that particular mass charge value.
The variable grp holds the index information as to which of these samples represent cancer patients and which ones represent normal patients.
DISCRIMINATIVE FEATURES
It shows the features that have
been selected by the genetic algorithm from the data set, the data is plotted
with peak positions marked with red vertical lines. The methodology followed in
this paper is to select a reduced set of measurements or "features" that
can be used to distinguish between cancer and control patients.
The GA discovered the optimal
features or peaks in mass spectrometry data. On the basis of the specific
features extracted from mass spectrometry data by using genetic algorithm, we
distinguished the cancer patients from control patients.
Future work should consist of more
experiments with other data sets, as well as more elaborated experiments to
optimize several parameters of the algorithm, such as mutation rates, the Limit
threshold for the weight field.
3 Comments
appreciable, also depict some results, charts, tabulations.
ReplyDeleteworth reading
ReplyDeleteBest article on the above topic
ReplyDelete