Molecular docking is a subject of very general interest
both in homo and heterogeneous molecular systems. At the limit of theory, wave
function of the system may be calculated and ensemble energy minimized. At
present this quantum mechanical approach is not practical for systems of
significant size because of computer limitations. Instead, the intermolecular
energy (only)may be represented by an empirical "force field" from
which the ensemble intermolecular energy may quickly be calculated. There remains however, and additional
serious problem in finding optimum molecular docking, since the energy
hypersurface for systems beyond the simplest has many subsidiary energy minima
as well as global minimum. Genetic Algorithms solves this problem.
The intermolecular
interaction is taken as a sum of atom- atom interactions. An atom in a
given molecule is assumed to interact with an atom in a different molecule according to the equation :
Eij = B exp(-Crij) - Arij-6 + qiqj rij -1
commonly referred as exp-6-1
interaction energy of atoms i and j distance rij. The first term
represents the short-range exchange term, the second represents dispersion
energy, and last represents Coulombic interaction.
The energy of the molecular structure is then the sum of
all nonbonded pair potential energy between the atoms of different molecules.
One molecule is fixed in space to set the translational and rotational origin.
There are six degrees of freedom in a rigid molecule. In a larger cluster of N
molecules, there are 6(N-1) degrees of freedom, divided equally between
rotation and translation. The rigidity of the molecules in the cluster leads to
a complicated intermolecular energy surface which includes subsidiary minima as
well as the global minimum. The major difficulty to be overcome in predicting
the structure of these clusters through energy minimization is the challenge of
searching the large and complex energy hypersurface to find the most stable or
global energy minimum.
In case of proteins, computing the functional
conformation of a protein molecule from the amino acid sequence is difficult for two reasons: the
folded conformations are poorly understood
and the space of possible conformations is very large and complex.,
making it difficult to search for an appropriate free energy minimum. While
analyzing the first problem requires detailed models of protein structure, the
second problem can be investigated on much simplified models. Here we address
the search problem and investigate the potential usefulness of GENETIC ALGORITHM, a recently established
search technique, for finding the functional conformation of a protein.
Optimization Methods
And Genetic Algorithms
There is a large
literature describing various approaches to optimization of functions. Methods
which have been used in energy minimization can be divided into three
categories, namely, grid search,
numerical, and stochastic. Grid search algorithm look at objective function
values laboriously for every point in the space, one at a time. This is a
really human kind of search and lacks efficiency. Numerical methods in use
require first derivatives(Gradient), and
sometimes second derivatives(Hessian), of energy with respect to
molecular rotation and translation. The calculation of these derivatives is
time consuming and limits the usefulness of numerical methods especially for
larger systems. Stochastic methods include, among others, random search, simplex
algorithm and simulated annealing. In stochastic category, random sampling of
grid points is rarely used in energy minimization because of the risk of
completely missing the global minimum. Some of the most commonly used methods
are described in greater detail below.
2.1 Steepest Descents
In steepest descents(SD), a line search direction is simply
taken as the current gradient. This direction oscillates on the way to minimum
and many steps will be needed. This inefficient behavior is characteristic of
SD especially on energy surfaces with narrow valleys. Convergence is slow near
a minimum because the gradient approaches zero, but this method works well even
when the starting point is far from the minimum.
2.2 Conjugate Gradient
Conjugate gradients(CG) improve the algorithm by minimizing
only along directions that are mutually conjugate so that movement along one
direction will not counteract progress made in earlier iterations. With SD or
CG, first derivatives of the objective function are needed. In general N2 independent data points are required to
solve an equation in N variables numerically. Since the gradient is a vector of
length N, the best one can hope in a gradient-based minimizer is its
convergence in N steps.
2.3 Newton-Raphson
However, if one has second derivative information, in
principle, a minimization can converge in one step because the second
derivatives form an N ´ N matrix. This is the principle behind the
variable metric minimization
algorithms of which Newton-Raphson(NR) is perhaps the most powerful. NR
will converge in one step if the
surface is purely second order. However as elegant as NR algorithm appears,
there are several drawbacks to its use in minimization. Firstly, the terms in
the Hessian may be difficult to derive
and may be computationally costly. Secondly, when a cluster structure is
far from minimum, the minimization can become unstable. This method diverges
rapidly if the initial forces are too high( or the surface too flat). Finally
calculating, inverting and sorting an N ´ N matrix for a large system
can become unwieldy. To successfully locate the global minimum when the system
falls into any local minimum on a multi-dimensional surface, numerical methods
must be argumented by some stochastic methods such as simulated annealing.
Success in reaching a global minimum can only be achieved by exhaustive grid search or by making use of more detailed advance knowledge
about the shape of hypersurface.
2.4 Simplex Method
The simplex(SX) method involves a series of steps. At the
beginning of the search, four points on a three dimensional subspace are chosen
that form the regular starting simplex. The function value at each of the
points is found. Now a newer simplex is created by mirroring the corner with
the highest function value through the opposite face of the simplex to a lower
point. Then the procedure is restarted. The reflected point should be lower
than one of the points in a new simplex selected by a line search. If it
rotates about a minimum, the search continues with a shortened simplex
baseline. Simplex optimization imposes restriction on the parameters and gives
rise to problems like discontinuities in parameter values and other
singularities. Also SX optimization may easily get trapped at a subsidiary
minimum in the problem hypersurface.
2.5 Simulated Annealing
Simulated Annealing(SA) is a numerical simulation of a
process in which a solid is heated in a heat bath until the particles of the
solid can move freely and the solid melts. Then the melt is allowed to cool
until it solidifies in a certain arrangement. The heating and cooling are
normally repeated many times. The most commonly used algorithm for SA was
originated by Metropolis and coworkers. To make use of SA one must have the
following elements: (1) Description of possible system configurations. (2) A
generator of random changes on the configuration. (3) An objective function E
whose minimization is the goal of the procedure. (4) The control parameter T
and an annealing schedule which tells how it is lowered from high to low. The
control parameter or temperature starts
at a relatively high value and gradually cools down. A large number of Monte
Carlo steps are performed at each temperature, typically 100-1000. Each step is
accepted with better performance or decrease in energy. If the energy increases, the step is
accepted with a probability exp(-DE/kT), where DE is the change in energy, T
is the temperature, and k is the Boltzman constant.
2.6 Genetic Algorithm(GA)
GA uses random choice to guide an intelligent and highly
exploitative search of difficult-to-optimize functions. Inspired by the
Darwinian notion of evolution in which only the fittest survive, this algorithm
crossbreeds trial solutions and allows only the “fittest” solutions to survive
after several generations. By maintaining a multipoint perspective on many
regions of the space, GA has a much higher chance of locating a global minimum.
The table below compares GA with other conventional optimization methods.
FEATURES SD CG NR SX
SA GA
Derivative calculation can
be avoided ?
√ √ √
Both local and global minima
can be found? √ √
Both continuos and discrete
search applicable? √ √
Overall robust?
√ √
Learning to adapt to the
environment? √
Parallel Computing √
3.1 Natural Evolution : The Initial Inspiration
Genetic Algorithms were invented to mimic the
processes observed in natural
evolution. John Holland, the inventor incorporated some of these processes in
solution of various problems.
The mechanisms that drive the evolutionary processes are
not fully understood, but some of its features are known. Evolution takes place
on "Chromosomes"-the organic
devices for encoding the structure of living beings. A living being is created partly by a process of
"decoding" chromosomes. The specifics of chromosomes are not fully
understood but some general features of the theory are widely accepted.
·
Evolution
is a process that operates on chromosomes rather than on the living beings they
encode.
·
Natural
selection is a link between their performances and their decoded structures. Process of natural selection causes those chromosomes
that encode successful structures to
reproduce more often than those that do not.
·
Process
of reproduction is a point at which reproduction takes place. Mutation may cause the child of biological
children to be different from those of their biological parents and
combination process may create quite
different children by combining material from chromosomes of their
parents.
·
Biological
evolution has no memory. Whatever it knows about reproducing individuals that will function well in
their environment is contained in the
gene pool-set of chromosomes carried by current individuals and in
the structures of the chromosome
decoders. These features of natural
evolution intrigued John Holland in early
1970's. Holland believed that appropriately incorporated in a
computer algorithm, they may yield a
technique to solve difficult problems in a way nature has done through evolution.
When Holland began to study these algorithms they did not
have a name. As the field began to
demonstrate its potential, it was necessary to christen it. In reference to its origins in the study of
genetics, Holland named the field
"Genetic Algorithms".
3.2 A Top-Level View Of The
Genetic Algorithm
To begin with, let us consider the mechanisms that link the
genetic algorithm to the problem it is solving. There are two such mechanisms -
a way of encoding the solutions to the problem on chromosomes, and an
evaluation function that returns the worth of any chromosome in the context of
the problem. We can use GA to carry out simulated evolution on a population of
solutions. Following is the general way GA operates.
1. Initialize a population
of chromosomes.
2. Evaluate each chromosome
in the population.
3. Create new chromosome by
mating current chromosomes; apply mutation and recombination
as the parent chromosome mate.
4. Delete older chromosomes
to make room for the new.
5. Evaluate the new
chromosome and insert them into the population.
3.3 The
Anatomy Of Genetic Algorithm
GA consists of three principal parts- Evaluation Module,
Population Module, and Reproduction
Module. The Evaluation Module contains a evaluation function that measures the
worth of any chromosome to the problem on hand. The Population Module contains
the population of chromosomes and techniques for creating and manipulating that
population. The Reproduction Module contains the techniques for creating new
chromosomes during the reproduction.
3.4 Genetic Algorithm
In Action
When GA is run,
all its modules work together to effect the
evolution of a population of chromosomes. First initialization technique
generates a population of N random bit strings. This population makes up the
'first generation' of chromosome. Next each chromosome in the population is
evaluated by the evaluation function, and fitness value for all chromosome are
computed by a fitness technique.
Now GA begins a
series of cycles of replacing the current population of chromosomes by a new
population. Each of these cycles produce
a new generation of chromosomes. In each cycle there are N/2 occurrences
to pick two 'parent' chromosomes. The Reproduction Module applies one-point
crossover and mutate operator to those
two chromosomes to generate two new chromosomes called 'children'. The operator
may cause those two children to differ from their parents. After N/2
occurrences of reproduction, N new chromosomes have been produced. These N
chromosomes are evaluated, their finesses
are calculated, and they replace the current N chromosomes to form the
next generation.
The generational replacement continues until a break
condition is encountered. At this point GA
halts. If all has gone well, a randomly generated population of
chromosomes has evolved so that, later generation contains individuals that are
better than any found earlier.
The interaction between the population module and other
modules of GA during a run is very simple. When GA is running, population
module asks reproduction module for a
new population. The reproduction module asks
population module for parents to be used in reproduction events. In each
reproduction event, the reproduction module gets two parents from population module,
applies one point crossover and mutate operator to the parents, and sends the
two children created to the population module until sufficient number of children have been generated to fill a new
generation of N chromosomes.
The population module interacts with the evaluation
module during the run, in that whenever a new set of children is produced, the
population module asks the evaluation
module to evaluate each new child before the child is placed in the population.
3.5 Parent Selection, Mutation,
And Crossover
Several components of GA
require further discussion. In this
section we examine each of them in detail.
3.5.1 Roulette Wheel Parent Selection
The purpose of parent selection in GA is to give more
reproductive chances, on the whole, to those population members that are most
fit. This type of selection does exactly that.
The procedure goes like this:
1. Sum the fitnesses of all
the members in the population; call the result total fitness.
2. General n , a random
number between 0 and the total fitness.
3. Return the first
population member whose fitness, added to the fitness of the preceding population members, is greater then or equal to
n.
Now, consider the following tables:

Chromosome 1 2
3 4 5 6 7
8 9 10
![]()
Fitness
8 2 17 7 2
12 11 7
3 7
Running Total 8 10
27 34 36
48 59 66
69 76
![]()

Random Number
23 49 76 13 1
27 57
Chromosome Chosen 3 7
10 3 1
3 7
Table shows a population of ten chromosomes with set of
evaluations totaling 76. The first row of table contains index of each
chromosome, the second contains each chromosomes fitness. The table also shows
set of 7 random numbers generated from
0 to 76, together with the index of the chromosome that would be chosen by
roulette wheel parent selection for
each of these numbers. In each case, the chromosome chosen is the first one at
which the running total of the fitness is greater than or equal to the random
number.
This algorithm is called roulette wheel selection because
it can be viewed as allocating pie-shaped slices on a roulette wheel to the
population members, with each slice proportional to the member's fitness.
Selection of a population member to be a parent can then be viewed as a spin of
the wheel, with winning population member being the one in whose slice the
roulette wheel spinner ends up.
3.5.2 One-Point Crossover And Mutate
Its function is to cause chromosomes created during
reproduction to differ from those of their parents. First the operator makes a
copy of the parents and their two children.
Then it applies two functions to those children that may alter them. These
functions are explained in reverse
order below.
3.5.3 Bit Mutation
Bit mutation is one procedure carried out by one-point
crossover and mutate. When bit mutation is applied to a bit string it sweeps
down the list of bits, replacing each by randomly selected bit if the
probability test is passed. Bit mutation has a associated probability parameter
which is typically quite low(0.008). Thus when a bit mutation is applied to a
chromosome during a run of GA , each bit in a chromosome will have eight in one
thousand chance of being randomly replaced.
Now consider the following table.
|
Old Chromosome |
Random Number |
New Bit |
New Chromosome |
|
1 0
1 0 |
.801
.102 .266 .373 |
- |
1 0
1 0 |
|
1 1
0 0 |
.120
.096 .005 .840 |
0 |
1 1
0 0 |
|
0 0
1 0 |
.760
.473 .894 .001 |
1 |
0 0
1 1 |
This table contains several
examples of the operation of bit mutation. We see that for first chromosome
probability check is never passed, so in this case the output of bit mutation
is same as the input. In the second case probability test is passed for the
fourth bit. However the randomly generated bit is same as the original bit, and so there is no effective change.
For the third chromosome, the
probability test is passed for the fifth bit and the new bit differs from the
old. Thus the third chromosome in the table is the only one that has changed by
bit mutation operator.
3.5.4 One-Point Crossover
There are other processes that alter chromosomes during
reproduction, and some evolutionary biologists believe that these may be at
least as important as bit mutation. One of them is called 'crossover'. In genetic
algorithm, crossover recombines the genetic material in two parent chromosomes
to make two children. John Holland experimented with a crossover operator
called 'one-point crossover'. One point crossover when parts of two parent
chromosomes are swapped after a randomly selected point, creating two children.
One-point crossover is shown below.
Parent 1: 1 1
1 1 1 1 Child 1:
1 1 1 0 0 0
===>
Parent 2: 0 0
0 0 0 0 Child 2:
0 0 0 1 1 1
![]()
Parent 1: 1 0
1 1 0 1 Child 1:
1 0 1 1 0 0
===>
Parent 2: 0 0
1 1 0 0 Child 2:
0 0 1 1 0 1
The children are made by cutting parent chromosomes at the
point denoted by the vertical line and exchanging parental genetic material
after the cut. Crossover is an extremely important component of a GA. Many GA
practitioners believe that if we deleted crossover from GA it would no longer
be a GA. Such claim has not been made for mutation operators. With out the
crossover operator GA is just like any other optimization algorithm like
dynamic programming, hill climbing etc. Crossover acts as a critical
accelerator as the GA runs. To see this let's consider two species of organisms
occupying roughly the same ecological
niche and same chromosomal structure. One called the cross and other called the
mutes. Set of mutes use only mutation,
and set of cross use both mutation and crossover. Suppose that mutation rate is
same for both the species, and suppose further that there are two independent
and quite beneficial mutations that will immensely improve these organisms fitness
in their environmental niche. Let's call these mutations mutation A and
mutation B.
What happens when mutation A and mutation B occurs in both
cross and the mutes. For mutes, members with mutation A or mutation B will
likely come to dominate the population, since these mutations will confer added
fitness on their predecessors. But it will take long time before any individual
comes to have both the mutations, since
an individual with mutation A can acquire mutation B only through mutation, and
these mutations are quite unlikely. For the cross, the situation is different. As soon as the
individuals with mutation A and mutation B come to dominate the population, it
becomes quite likely that crossover will act to combine them. Thus cross with
both the mutations A and B will have an added edge over the mutes. Thus we see
that mutation increases the diversity in the population and then crossover
spreads the diversity through out the population.
The point here is that crossover acts to combine the
'building blocks' of good solutions from diverse chromosomes. For example
chromosome (011011) contains 32 building blocks including
'0##0#','#1###','011#1','#####','01101'. Where # indicates bit values at those positions which are not
important for the solution.
Holland calls each building block a 'schema'. His
investigation into the success of GA led him to conclude that GA manipulate
'schemata' as they run and Holland's Schema theorem asserts precisely this. Let
'r' be the average fitness of all chromosomes in the population containing
schema 's'. Let 'n' be the number of chromosomes containing 's'. Let 'a' be the
average fitness of all the chromosomes in the population. Then the expected
number of occurrences of 's' in the next generation of the population is nr/a, minus the disruptions caused by
mutation and crossover. This feature of GA has been described by Holland as the
'intrinsic parallelism', in that the algorithm is manipulating large number of
schema in parallel.
3.6 The Inversion Operator
This operator acts in a way that it inverts the order of
elements between two randomly chosen points. Thus unlike one-point crossover,
it takes one parent and returns one child. While this operator was inspired by
a biological process, it requires additional overhead and it has not in general
been found to be useful in GA practice.
3.7 Cost Function, Decision
Variables And Search
Space
In most of the practical optimization problems, the aim
is to find the optimal parameters to increase the production and/or to reduce
the expenditure/ loss , that is to get maximum profit by reorganizing the
system and its parameters. Since this will finally reflect the cost this is
called the cost function. A carefully
written and convergent computational algorithm would eventually find an optimum
solution. The parameters of the system that decide the cost are called decision
variables. The search space is Euclidean in which parameters take different
values and each point in this space is a possible solution of the problem.
3.8 A Simple Genetic Algorithm Problem
Consider the problem of maximizing the function
f(x) = x 2 -64x + 100
This function has a global
maximum value of 100 at x = 0, as can be easily computed. To use a genetic
algorithm, decision variables of the problem have to be coded in strings of
finite length. For this problem, we can encode the variables as a binary string
of length 6. We create an initial population with 4 samples by randomly
selecting them from the interval 0 to 63 and encode each sample. A binary
string of length 6 can represent any value from 0 to 63(26 - 1);
hence string length is chosen as 6 for example. Four encoded samples in the
initial population are:
5 = 0 0 0 1 0 1
60 = 1 1 1 1 0 0
33 = 1 0 0 0 0 1
8 = 0 0 1 0 0 0
These individuals are sorted
according to their fitness values.( They are arranged in the descending order
of their fitness values). In this case, fitness value is same as the cost
function value. The sorted individuals are given as :
|
No. |
x |
String |
Fitness |
|
1 |
60 |
1 1 1 1 0 0 |
-140 |
|
2 |
5 |
0 0 0 1 0 1 |
-195 |
|
3 |
8 |
0 0 1 0 0 0 |
-348 |
|
4 |
33 |
1 0 0 0 0 1 |
-923 |
In the 1st generation, the
1st and 2nd strings are corssed over at site 3 (crossover stie randomly
selected) to get two new strings:
crossover site
newstrings fitness of new
strings
1 1 1 1 0 0 1 1 1 1 0 1 -83
0 0 0 1 0 1 0 0 0 1 0 0 -140
Similarly 3rd and 4th
strings are crossed over at crossover site , to get:
crossover site new strings fitness
of new strings
0 0 1 0 0 0 0 0 0 0 0 1 37
1 0 0 0 0 1 1 0 1 0 0 0 -860
storing these new individuals we get
|
No. |
x |
String |
Fitness |
|
1 |
1 |
0 0
0 0 0 1 |
37 |
|
2 |
61 |
1 1
1 1 0 1 |
-83 |
|
3 |
4 |
0 0
0 1 0 0 |
-140 |
|
4 |
40 |
1 0
1 0 0 0 |
-860 |
We see that in one
generation fitness is improved from -140 to 37 (f(60) < f(1)). Before
proceeding to the next step, the weakest member in the ppulaion is replaced by
the fittest member of the previous population, that is, string 1 1 1 1 0 0,
whose fitness is -140.
In 2nd generation, the 1st and 2nd strings are
crossed over at site 1, to get:
crossover site new strings fitness of new strings
0 0 0 0 0 1 0
1 1 1 0 1 -195
1 1 1 1 0 1 1
0 0 0 0 1 -923
Similarly 3rd and 4th
strings are crossed over at crossover site 3 to get:
crossover site new strings fitness of new strings
0 0 0 1 0 0 0
0 0 1 0 0 -140
1 1 1 1 0 0 1
1 1 1 0 0 -140
Replacing the weekest member
by the fittest member of the previous population (string 1 0 0 0 0 1 with
fitness value of -923 is replaced by string 0 0 0 0 0 1 with fitness value of
37) and storing them according to the fitness values, we get:

No. x string fitness
![]()
1 1 0
0 0 0 0 1
37
2 4 0 0 0 1 0 0 -140
3 60 1
1 1 1 0 0 -140
4 29
0 1 1 1 0
1 -915
In the 3rd generation, the
above process of crossover (at site=4)
is repeated (not shown here). The new set of strings in the population, after
replacement of the weakest by the fittest member of the previous population is
given as:

No. x string fitness
1 0 0 0 0 0 0 0 100
2
1 0 0 0 0 0
1 37
3 61 1 1 1 1 0 1 -83
4 5 0 0 0 1
0 1 -195
For simplicity, the
probability of mutation is chosen as 0(zero). We see that as the genetic operations continue from one
generation to the next, improved solutions evolve. At x = 0, f(x) = 100, which
is the desired result. Here, the problem which could have been solved using
conventional computational optimization algorithms, is used to illustrate GA
operations for the sake of simplicity.
In general f can be any function of x,y,z..... . But in
our case it is the energy function, because we are interested in minimizing the
energy.
Performance Enhancement Of
GA
In the remainder of this section we investigate four
improvements to the performance of GA.
There are various ways to
enhance the performance of the GA, which are described as follows:
Original Evaluation 200 9 8 8 4 1
Fitness is Evaluation 200 9 8 8 4 1
Windowing with Minimum =
0 199 8 7 7 3 0
Windowing with Minimum =
10 199 10 10 10
10 10
Linear Fitness With
Decrement = 1 100 99 98 97 96
95
Linear Fitness With
Decrement = 20 100 80 60 40 20
1
We have constructed the data in above table so that two
common features of GA runs are exemplified. The first is the existence of a
chromosome that is much better than its nearest competitor. This chromosome -
'a super individual' -will probably eliminate all its competitor. In a
generation or two when fitness is equal to the evaluation, the super individual will have had very few
chances to try recombination with other population members, and the population
will probably experience rapid convergence. The second feature is that there
are several individuals with evaluations that are very close. GAs often finds
sets of close competitors, especially at the end of their runs. When there are
close competitors, we may well want to increase the amount of selection
pressure, so that the best competitors are encouraged to reproduce more often.
4.1 Elitism
We noted earlier that the best member of the population may
fail to produce offspring in the next generation. The 'elitist' strategy fixes
this potential source of loss by copying the best member of each generation
into the succeeding generations. This strategy may increase the speed of domination
of a population by a super individual, but on balance it appears to improve
genetic performance.
4.2 Steady-State Reproduction
When GA reproduces, it replaces its entire set of parents
by their children. This generational replacement technique has some potential
drawbacks. One is that even in an elitist strategy, many of the best individuals
found may not reproduce at all and their genes may be lost. It is also possible
that mutation or crossover may alter the best chromosomes genes so that
whatever was good about them is lost. Neither of these outcomes is desirable.
It works as follows.
1. Create n
children through reproduction.
2. Delete n
members of the population to make room for the children.
3. Evaluate and insert the
children into the population.
4.3 Steady-State Without
Duplicates
Steady state with duplicates is a reproduction technique
that discards the children that are duplicate of the current population rather
than inserting them into the population. When we use this technique, every
member of the population will be different. This technique allows much more use
of our allotted number of chromosomes
by guaranteeing that reproduction never creates duplication in the population.
4.4 Windowing
Find the minimum evaluation in the population. Assign each
chromosome a fitness equal to the amount that it exceeds this minimum.
Optionally, a minimum amount greater than the minimum value may be created as a
guard against the lowest chromosomes having no chance at reproduction.
4.5 Linear Normalization
The chromosomes are arranged in decreasing order of evaluation.
The fitness values are assigned a
constant initial value which decreases in a
linear fashion as GA progresses. The constant value and the rate of decrement are the parameters of
this technique.
Implementation And Results
We wish to develop an implementation of GA suitable for
protein folding. Thus, we seek to use th simplest model that still captures the
essence of the important components of protein folding. The linear sequence is
composed of equally spaced amino acids
of only two types: hydrophobic(1) and hydrophilic(0). This sequence is
folded on a two dimensional square lattice on which at each point the chain can
turn 90o left or right, or continue ahead. Figure below shows
possible conformations of the 20 amino acid molecule :
1-0-1-0-0-1-1-0-1-0-0-1-0-1-1-0-0-1-0-1. The energy function is so chosen that
it takes on a value of -1 for each direct contact(occupying neighboring
non-diagonal lattice points) of non bonded hydrophobic amino acids. Under this
energy function, low energy conformations are compact with a hydrophobic core:
since hydrophobic-hydrophobic interactions are rewarded, the hydrophobic-1
residues tend to be on the inside of a low energy structure, while the
hydrophilic-0 are forced to the surface follows. This energy function was
chosen to mimic what happens in nature. Fully folded proteins always have a
hydrophobic core buried inside and the hydrophilic amino acids are exposed to
the surroundings.
![]()
![]()
o o
![]()
![]()
![]()
![]()
n n o o o - 0
![]()
![]()
o n n n n ■ - 1
![]()
![]()
![]()
o n n n o
![]()
o n o
Energy Evaluation = -9
![]()
![]()
o n o
![]()
n n o
![]()
n o
n
![]()
o o
![]()
![]()
![]()
![]()
n o o o
n n n
n o
Energy Evaluation = -4
In implementing GA one has to choose the appropriate
method of encoding the data, the size of the population, the specific manner of
applying the genetic operators, and the population pruning scheme. Our
implementation of GA is unique in that the solutions are not encoded as binary
strings but rather are the
confirmations themselves which are treated directly in the sprit of genetic operators. The process
starts with N extended structures. In each generation each structure is
subjected to number of mutation steps. At the end of this mutation stage the
crossover operation is performed. The chance p(Si) of a structure
being selected for crossover is directly proportional to its energy value Ei,
that is :
p(Si)=Ei / (ĺ Ej ; j running from 1 to N).
Thus the lower confirmation
energy have a higher chance of being selected. For pair of selected structures
a random point is chosen along the sequence and the N-terminal portion of the
first structure is connected to C-terminal
portion of the second structure as follows.
(A)
(B) (C)
In the above example, the cut point was randomly chosen
to be after residue 14. Joining the first 14 residues of (a) with the last 6
residues of (b) and applying a randomly chosen 270o rotation at that point achieves the compact
structure in (c). In this case energy value of the hybrid (c) is -9 .,lower
than the energies -5 and -2 of its
“parents” . The hybrid was accepted if its energy is lower than averaged
energies of is parents, or non deterministically accepted according to its
energy increase.
As there are three ways to join the parts
together(connecting the chains with the angle of 0o, 90o
or 270o), these possibilities are listed in random order to find one
that is valid(that is where no residue from one structure occupies a lattice
point used by a residue from the other). If none of the three ways lead to a
self avoiding structure, then another pair is selected. Once a valid structure
Sk is created, its energy Ek is evaluated and compared to
the average energy Eij=(Ei+Ej)/2 of its 'parents'. The structure is accepted
if Ek<=Eij, or if the energy will be increased based on
the decision:
Rnd < exp[(Eij-Ek -Pij +Pk)/Ck].
Where Rnd is a random number
between 0 and 1 and Ck is a parameter so chosen that it maintains
the diversity of the population and prevents premature convergence to low
energy conformations. Since the low energy conformations have a compact
structure with rows and columns of amino acids, the evaluation function gives higher weightage to those
structures with higher schema(patterns of regularly arranged rows and columns).
Pij and Pk reflect the corresponding schema of the
parents and the child.
This crossover operation is repeated until N-1 new
accepted hybrid structures have been constructed to constitute the population
of the next generation. In addition, the lowest energy conformation in each
generation id directly replicated in the next generation. We allow higher
acceptance rate for bad moves that increases the energy for mutation steps then
for crossovers.
The simulation was done using four sets of chains
having 20, 36, 48 and 60 amino acids respectively. Separate programs were
written to take care of these four sets. The input to these program is the
population of random solutions which
were generated separately. The input data file includes the X and Y coordinates
of each amino acid in the chain and its nature, hydrophobic or hydrophilic. The
output data file gives the new coordinates of the hydrophobic and hydrophilic
units which corresponds to the minimum energy configuration.
5.1 Results
The energy minimization was
done for four different cases, varying the length of amino acid units. The four
different lengths of amino acid were 20
units,36 units,48 units, and 60 units. The results obtained are as follows for
hydrophobic(1) and hydrophilic(0)
5.1.1 20 Units
Input Chain :
10100110100101100101
Energy minimum: -9
Conformation:
5.1.2 36 Units
Input Chain:000110011000001111111001100001100100
Energy minimum: -14
Conformation:
5.1.3 48 Units
Input Chain:001001100110000001111111111
000001100110010011111
Energy minimum: -22
Conformation:
5.1.4 60 Units
Input
Chain:001110111111110001111111111010
001111111111110000111111011010
Energy minimum: -34
Conformation:
The disk attached contains the code and the data. The
source codes corresponding to the various chain lengths 20, 36, 48 and 60 are
pro20.for, pro36.for, pro48.for, pro60.for. The corresponding input files are
pro20.in, pro36.in, pro48.in and pro60.in ; the output files are en20.out,
en36.out, en48.out and en60.out. The codes are compiled using MS Fortran
Powerstation on a Windows based PC.
File Path Description
.\input\*.in
Input Files
.\code\*.for Fortran
source code
.\output\*.out Output
files
REFERENCES
1. Mitchell A. Potter (1997) - The Design and Analysis of a Computational Model for Cooperative
Coevolution. Ph.D. thesis, George
Mason University, Fairfan, VA.
2. Jerry Bala, Kenneth A. De Jong, Jeffrey Huang, Haleh Vafaie,
and
H. Wechsler (1997) - Using Learning to Facilitate the
Evolution
of Features for Recognizing Visual Concepts. In Special
Issue of Evolutionary
Computation - Evolution, Learning, and Instinct : 100
Years of the
Baldwin Effect, pages 297-311, MIT press.
3. William M. Spears (1997) - Recombination Parameters. In the
Handbook of
Evolutionary Computation. T. Baeck, D. Fogel
and Z. Michalewichz (editors), IOP Publishing and Oxford
University
Press.
4. K. Deb and William M Spears (1997) - Speciation Methods in
the
Handbook of Evolutionary Computation. T. Baeck, D. Fogel
and
Z.
Michalemicz (editors), IOP Publishing and Oxford University
Press.
5. Jayashree Sarma and Kenneth De Jong (1996) - An Analysis of
the Effect Neighborhood Size and Shape on Local Selection Algorithms.
Proceedings of
the Fourth International Conference on Parallel
Problem Solving fro`m Nature. Sept. 22-26, 1996, Berlin,
Germany.
6. Mitchell A. Potter and Kenneth A. De Jong (1995) Evolving Neural
Networks with Collaborative Species. In Proceedings of
the 1995 summer Computer Simulation Conference. July 24-26,
Ottawa, Ontario, Canada , Pages
340-345. The Society for Computer
Simulation.
7. Yongliang Xiao and Donald E. Williams( July 1993) - Genetic
Algorithm: A New Approach to the Prediction of the
Structure of
Molecular Clusters, Chemical Physics Letters, Vol 215,
Number
1.2.3, Pages 17-24.
8. Ron Unger and John Moult (July1992) - Genetic Algorithms for
Protein Folding Simulations, Journal of Molecular
Biology, Vol 321,
Pages 75-81.
9. Jitendra R Raol and Abhijit Jalisatgi (Aug 1996) - From
Genetics
to Genetic Algotirhms - Solution to Optimisation Problems
Using
Natural Systems, Resonance, Vol 1 Number 8, August 1996,
Pages 43-54.
10. Lawrence Davis(Ed. By) - Handbook of Genetic Algorithms,1991
Van Nostrand Reinhold - New York .