ABSTRACT
In this paper the
Genetic Algorithm has been used to mine the real world dataset in medical
domain. The growth of information has proceeded at an explosive rate in recent
years.
The size of data available now is
beyond the capability of our mind to analyze. Due to the size of the data, it
is extremely difficult to draw meaningful conclusions about the data. To overcome
the above scenario, data mining technology is introduced.
In fact we are busy with data in
most fields and not concerned about hidden patterns. To a certain extent, there
are not enough qualified human analysts available who are expert at translating
all of this data into knowledge and find the interesting patterns. Data Mining
is a process that starts with data and ends with previously unknown patterns an
d knowledge. The Knowledge Discovery in Databases (KDD) process will be used
and on the data mining stage the GA is applied in this paper. This series of
activities is divided into five steps. The raw data is selected and analyzed
during the steps to reveal patterns and create new knowledge.
The KDD will include several steps
including Selection, Pre -processing, Transformation, Data Mining and
Interpretation of pattern.
INTRODUCTION
Databases are valuable treasures. A
database not only stores and provides data but also contains hidden precious
knowledge, which can be very important. It can be a new law in science, a new
insight for curing a disease or a new market trend that can make millions of
dollar. Data mining, or knowledge discovery in database, is the automated
process of sifting the data to get the gold buried in the database. The
database of bioinformatics is also very huge. Basically bioinformatics is the
application of computer science and information technology to the field of
biology and medicine.
It includes area of data mining,
image processing, databases and information systems, information and
computation theory, algorithms, web technologies, artificial intelligence and
soft computing, structural biology, software engineering, modeling and
simulation, signal processing, discrete mathematics, statistics.
Mass spectrometry is actively being
used to discover disease related proteomic patterns in complex mixtures of
proteins derived from tissue samples or from easily obtained biological fluids.
The potential importance of these clinical applications has made the
development of better methods for processing and analyzing the data an active
area of research. It is, however, difficult to determine which methods are
better without knowing the true biochemical composition of the samples used in
the experiments.
A mass spectrometer is an instrument
that measures the masses of individual molecules that have been converted into
ions, i.e., molecules that have been electrically charged. Since molecules are
so small, it is not convenient to measure their masses is kilograms, or grams,
or pounds. We therefore need a more convenient unit for the mass of individual
molecules.
Bioinformatics research makes many
problems necessary that can be cast as machine learning tasks. In
classification or regression, the task is to predict the outcome associated
with a particular individual given a feature vector describing that individual;
in clustering, individuals are grouped together because they share certain
properties; and in feature selection, the task is to select those features that
are important in predicting the outcome for an individual.
Mass spectrometry (MS) is a logical
technique used to measure the mass to charge ratio of charged particles. The
masses of particles are determines by MS which is used to find out the
elemental composition of a sample or molecule. The chemical structures of
molecules are also explained by the masses of particles. MS works as following
it ionizes chemical compounds to generate charged molecules and measures their
mass to charge ratios.
MS is currently being used to find
out disease related patterns of proteins (proteomic) which are derived from the
samples of or from biological fluids. The likely significance of these clinical
applications has made the advance of better methods for processing and
analyzing the data an energetic area of research.
The KDD process is interactive and
iterative, involving numerous steps with many decisions made by the user. A
practical view of the KDD process is given in that emphasize the interactive
nature of the process. Data mining is the process of discovering meaningful new
correlations, patterns and trends by sifting through large amounts of data
stored in repositories, using pattern recognition technologies as well as
statistical and mathematical techniques.
IMPORTANCE
The genetic algorithm will be applied
on data mining step of the KDD process. The core purpose of the paper is to
apply Genetic Algorithm on mass spectrometry dataset to find the interesting
patterns. The GA will search for the optimal features or peaks in mass
spectrometry data. On the basis of the specific features extracted from mass
spectrometry data by using Genetic Algorithm we will distinguish cancer
patients from control patients. We will select a reduced set of features or
measurements that can be used to distinguish between cancer and control
patients. These features will be ion intensity levels at specific mass charge
values. The discriminative features differentiate between control patient and
cancer patient.
FORMULATION OF
PROBLEM
The Knowledge Discovery in Databases
(KDD) process will be used and on the data mining stage the GA is applied in
this paper. This series of activities is divided into five steps. The raw data
is selected and analyzed during the steps to reveal patterns and create new
knowledge. The approach to solve the problem will include the following steps. Selection
The huge amount of raw data needs to be preselected for the following steps to
reduce the overhead of data. Data understanding and background knowledge are
essential requirements for the selection phase.
PREPROCESSING
The preprocessing step is executed
after data selection step. It is a heavy and time consuming task. The
preselected data is verified to find unsuitable values and edited where
required. During the verification step missing data may be discovered, as a
result of not or wrongly measurements or instrument malfunctions. Missing
values can be completed by human input, averaged values or fuzzy set values for
example.
TRANSFORMATION
The transformation step is executed
after the preprocessing to create a descriptive model of the data to enable a
computer based processing. Dimension reduction is used to reduce the amount of
data and at the same time it keeps the content as similar as possible.
DATA MINING
(DM)
The data mining step is executed
after transformation step. The overall process of Data Mining is for the
recognition and extraction of patterns from the transformed data set. A Data
Mining technique (e.g. classification, clustering) fitting best to the
application requirements has to be chosen. In this step the GA will be used for
recognition and extraction of the patters from the dataset.
INTERPRETATION
OF PATTERNS
The results and recognized patterns
of the data mining process are interpreted to create new knowledge.
The results can influence every step of the overall process.
GENETIC
ALGORITHM
The Genetic Algorithm was developed
by John Holland in 1970. They are based on the genetic
processes of biological organisms. Over many generations, natural populations
evolve according to the principles of natural selection and “survival of the
fittest”, first clearly stated by Charles Darwin in the Origin of Species. GAs
are adaptive method which may be used to solve search and optimization
problems. After a number of new generations built with the help of the
described mechanisms one obtains a solution that cannot be improved any
further. This solution is taken as a final one.
The specific kind of GA used
throughout this work is a Standard GA. A Standard GA is one type of different
GAs. In a big view GAs is one under group to the term of evolutionary algorithms.
In general GAs is a search algorithm based on the natural selection and genetics.
It uses a number of artificial
individuals looking through a complex search space by using functions of
selection, crossover and mutation. The purpose to use GA is searching and
finding optimal or good enough solution. This solution will hide in a big search
space to look through. There is no guaranty to find any exact solutions when
using a GA.
4 Comments
very informative
ReplyDeletesuperb blog, full of knowledge for all
ReplyDeleteWell done
ReplyDeleteVery informative and helpful
ReplyDelete