DATA MINING WITH GENETIC ALGORITHM THE PROPOSED MODEL

 PROPOSED MODEL

            In the proposed model the formal knowledge discovery in database process (KDD) is adopted to perform the data mining task, to get the interesting patters or knowledge from the dataset. The steps used in the methodology are as follows:-

 1. Learning the application domain to extract relevant prior knowledge and goals of application.

2. Creating a target data set from raw data

3. Data cleaning and preprocessing

4. Data reduction and transformation

5. Choosing data mining approach: Classification

6. Choosing the mining algorithm(s): Genetic Algorithm

§  Initialization

§  Selection

§  Crossover

§  Mutation

§  Termination

7. Data mining: patterns of interest found by GA

8. Evaluation of Patterns and knowledge presentation Visualization, removing redundant Patterns.

9. Use of discovered knowledge:

 DESCRIPTION OF THE PROPOSED MODEL

            Description of the Proposed Model The vast amount of raw data has been selected from the medical domain in advance. Data cleaning, preprocessing, reduction and transformation has been done through Bioinformatics toolbox in MATLAB then the target dataset has been created from the raw data. The approach to mine the data used is classification of patients into normal and cancer patients.

             The GA is used for the classification purpose and searches the significant pattern of masses. The parameters used by the GA are as follow:-

 Initialization

            The Initialized populations are values of type integers that correspond to randomly selected rows of the mass spectrometry data. Each row of the population matrix is a random sample of row indices of the mass spec data.

 Selection

            Stochastic Universal Sampling (SUS) is for the selection of parents for recombination (crossover). The selection function chooses the parents using roulette wheel and uniform sampling, based on expectation and number of parents. Given a roulette wheel with a slot for each expectation whose size is equal to the expectation. We then step through the wheel in equal size steps, so as to cover the entire wheel in steps of total number of parents. At each step, we create a parent from the slot we have landed in. This mechanism is fast and accurate.

Crossover

            The crossover function is position independent crossover. This crossover function creates the crossover children of the given population using the available parents. In single or double point crossover, genomes that are near each other tend to survive together, whereas genomes that are far apart tend to be separated. The technique used here eliminates that effect. Each gene has an equal chance of coming from either parent. This is sometimes called uniform or random crossover.

Mutation

            The mutation applied is Gaussian mutation. It class how the GA makes small random changes in the individuals in the population to create mutated children. Scale controls what fraction of the gene's range is searched.

            A value of 0 will result in no change and a scale of 1 will result in a distribution whose standard deviation is equal to the range of this gene. Intermediate values will produce ranges in between these extremes. Shrink controls how fast the scale is reduced as generations go by. A shrink value of 0 will result in no shrinkage, yielding a constant search size.

        A value of 1 will result in scale shrinking linearly to 0 as GA progresses to the number of generations specified by the options structure. Intermediate values of shrink will produce shrinkage between these extremes. We should note that shrink may be outside the interval (0, 1), but this is still advised. If since no values for scale or shrink are specified. The values of scale and shrink are set to 1.

Termination

The algorithm terminates after the 51 generation or the stall generation limit of 50.

SIMULATION RESULTS AND FURTHER DISCUSSIONS

            The proposed model in the paper is implemented in MATLAB Version 7.0. After implementing the model it is executed on machine having 2 GB of RAM and core i3 processor. The preprocessing step in the model took near about 20 to 25 minutes to prepare the data set.

            There are three variables in the dataset, they are Y, MZ and grp. There are 15000 rows and 216 columns in Y. The measurements taken from a patient is represented by each column in variable Y. The columns in Y represent 216 patients.

          The mass charge value is represented by the rows. There are total 121 cancer patients and 95 normal patients. The ion intensity level at a specific mass charge value in MZ is represented by each row in variable Y. The mass charge values are stored in MZ. There are 15000 mass charge values in MZ and each row in Y represents the ion intensity levels of the patients at that particular mass charge value.

         The variable grp holds the index information as to which of these samples represent cancer patients and which ones represent normal patients.

DISCRIMINATIVE FEATURES

            It shows the features that have been selected by the genetic algorithm from the data set, the data is plotted with peak positions marked with red vertical lines. The methodology followed in this paper is to select a reduced set of measurements or "features" that can be used to distinguish between cancer and control patients.

            These features will be ion intensity levels at specific mass/charge values. Observe the interesting peak around 8100 Dalton (Da) which seems to be shifted to the right on healthy samples. The discriminative features differentiate between Control patient and Cancer patient.

 CONCLUSION AND FUTURE SCOPE

            The GA discovered the optimal features or peaks in mass spectrometry data. On the basis of the specific features extracted from mass spectrometry data by using genetic algorithm, we distinguished the cancer patients from control patients.

             Finally results of the simulation were discussed and graphs of the result were plotted in for the better understandings of the findings. The work presented in this paper is only the stating of exploration of the clinical data or the medical datasets.

             Many other issues are still to be resolved and warrant further investigation. Following are some suggestions to extend this work. The future directions of work presented in the Paper would be the following:-

             Other data mining techniques such as prediction, association rule mining clustering will be applied on the dataset. More robust models for the above data mining task will be designed in MATLAB.

             The comprehensibility of the discovered patterns (features) could be improved with a proper modification of the fitness function.

            Future work should consist of more experiments with other data sets, as well as more elaborated experiments to optimize several parameters of the algorithm, such as mutation rates, the Limit threshold for the weight field. 

Post a Comment

3 Comments