DATA MINING USING NEW PERSPECTIVE GENETIC ALGORITHM

ABSTRACT

           In this paper the Genetic Algorithm has been used to mine the real world dataset in medical domain. The growth of information has proceeded at an explosive rate in recent years.        

             The size of data available now is beyond the capability of our mind to analyze. Due to the size of the data, it is extremely difficult to draw meaningful conclusions about the data. To overcome the above scenario, data mining technology is introduced.

             In fact we are busy with data in most fields and not concerned about hidden patterns. To a certain extent, there are not enough qualified human analysts available who are expert at translating all of this data into knowledge and find the interesting patterns. Data Mining is a process that starts with data and ends with previously unknown patterns an d knowledge. The Knowledge Discovery in Databases (KDD) process will be used and on the data mining stage the GA is applied in this paper. This series of activities is divided into five steps. The raw data is selected and analyzed during the steps to reveal patterns and create new knowledge.

             The KDD will include several steps including Selection, Pre -processing, Transformation, Data Mining and Interpretation of pattern.

 INTRODUCTION

            Databases are valuable treasures. A database not only stores and provides data but also contains hidden precious knowledge, which can be very important. It can be a new law in science, a new insight for curing a disease or a new market trend that can make millions of dollar. Data mining, or knowledge discovery in database, is the automated process of sifting the data to get the gold buried in the database. The database of bioinformatics is also very huge. Basically bioinformatics is the application of computer science and information technology to the field of biology and medicine.

         It includes area of data mining, image processing, databases and information systems, information and computation theory, algorithms, web technologies, artificial intelligence and soft computing, structural biology, software engineering, modeling and simulation, signal processing, discrete mathematics, statistics.

           Mass spectrometry is actively being used to discover disease related proteomic patterns in complex mixtures of proteins derived from tissue samples or from easily obtained biological fluids. The potential importance of these clinical applications has made the development of better methods for processing and analyzing the data an active area of research. It is, however, difficult to determine which methods are better without knowing the true biochemical composition of the samples used in the experiments.

             A mass spectrometer is an instrument that measures the masses of individual molecules that have been converted into ions, i.e., molecules that have been electrically charged. Since molecules are so small, it is not convenient to measure their masses is kilograms, or grams, or pounds. We therefore need a more convenient unit for the mass of individual molecules.

          Bioinformatics research makes many problems necessary that can be cast as machine learning tasks. In classification or regression, the task is to predict the outcome associated with a particular individual given a feature vector describing that individual; in clustering, individuals are grouped together because they share certain properties; and in feature selection, the task is to select those features that are important in predicting the outcome for an individual.

          Mass spectrometry (MS) is a logical technique used to measure the mass to charge ratio of charged particles. The masses of particles are determines by MS which is used to find out the elemental composition of a sample or molecule. The chemical structures of molecules are also explained by the masses of particles. MS works as following it ionizes chemical compounds to generate charged molecules and measures their mass to charge ratios.

             MS is currently being used to find out disease related patterns of proteins (proteomic) which are derived from the samples of or from biological fluids. The likely significance of these clinical applications has made the advance of better methods for processing and analyzing the data an energetic area of research.

             The KDD process is interactive and iterative, involving numerous steps with many decisions made by the user. A practical view of the KDD process is given in that emphasize the interactive nature of the process. Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

 IMPORTANCE

            The genetic algorithm will be applied on data mining step of the KDD process. The core purpose of the paper is to apply Genetic Algorithm on mass spectrometry dataset to find the interesting patterns. The GA will search for the optimal features or peaks in mass spectrometry data. On the basis of the specific features extracted from mass spectrometry data by using Genetic Algorithm we will distinguish cancer patients from control patients. We will select a reduced set of features or measurements that can be used to distinguish between cancer and control patients. These features will be ion intensity levels at specific mass charge values. The discriminative features differentiate between control patient and cancer patient.

 FORMULATION OF PROBLEM

             The Knowledge Discovery in Databases (KDD) process will be used and on the data mining stage the GA is applied in this paper. This series of activities is divided into five steps. The raw data is selected and analyzed during the steps to reveal patterns and create new knowledge. The approach to solve the problem will include the following steps. Selection The huge amount of raw data needs to be preselected for the following steps to reduce the overhead of data. Data understanding and background knowledge are essential requirements for the selection phase.

PREPROCESSING

             The preprocessing step is executed after data selection step. It is a heavy and time consuming task. The preselected data is verified to find unsuitable values and edited where required. During the verification step missing data may be discovered, as a result of not or wrongly measurements or instrument malfunctions. Missing values can be completed by human input, averaged values or fuzzy set values for example.

 TRANSFORMATION

            The transformation step is executed after the preprocessing to create a descriptive model of the data to enable a computer based processing. Dimension reduction is used to reduce the amount of data and at the same time it keeps the content as similar as possible.

 DATA MINING (DM)

            The data mining step is executed after transformation step. The overall process of Data Mining is for the recognition and extraction of patterns from the transformed data set. A Data Mining technique (e.g. classification, clustering) fitting best to the application requirements has to be chosen. In this step the GA will be used for recognition and extraction of the patters from the dataset.

INTERPRETATION OF PATTERNS

           The results and recognized patterns of the data mining process are interpreted to create new knowledge. The results can influence every step of the overall process.

GENETIC ALGORITHM

            The Genetic Algorithm was developed by John Holland in 1970. They are based on the genetic processes of biological organisms. Over many generations, natural populations evolve according to the principles of natural selection and “survival of the fittest”, first clearly stated by Charles Darwin in the Origin of Species. GAs are adaptive method which may be used to solve search and optimization problems. After a number of new generations built with the help of the described mechanisms one obtains a solution that cannot be improved any further. This solution is taken as a final one.

             The specific kind of GA used throughout this work is a Standard GA. A Standard GA is one type of different GAs. In a big view GAs is one under group to the term of evolutionary algorithms. In general GAs is a search algorithm based on the natural selection and genetics.

         It uses a number of artificial individuals looking through a complex search space by using functions of selection, crossover and mutation. The purpose to use GA is searching and finding optimal or good enough solution. This solution will hide in a big search space to look through. There is no guaranty to find any exact solutions when using a GA.

Post a Comment

4 Comments