CHAPTER 1

Introduction

 

           Molecular docking is a subject of very general interest both in homo and heterogeneous molecular systems. At the limit of theory, wave function of the system may be calculated and ensemble energy minimized. At present this quantum mechanical approach is not practical for systems of significant size because of computer limitations. Instead, the intermolecular energy (only)may be represented by an empirical "force field" from which the ensemble intermolecular energy may quickly be calculated.  There remains however, and additional serious problem in finding optimum molecular docking, since the energy hypersurface for systems beyond the simplest has many subsidiary energy minima as well as global minimum. Genetic Algorithms solves this problem.

        The intermolecular  interaction is taken as a sum of atom- atom interactions. An atom in a given molecule is assumed to interact with an atom in a different  molecule according to the equation :

                                    Eij  = B exp(-Crij) - Arij-6   + qiqj rij  -1   

commonly referred as exp-6-1 interaction energy of atoms i and j distance rij. The first term represents the short-range exchange term, the second represents dispersion energy, and last represents Coulombic interaction.

        The energy of the molecular structure is then the sum of all nonbonded pair potential energy between the atoms of different molecules. One molecule is fixed in space to set the translational and rotational origin. There are six degrees of freedom in a rigid molecule. In a larger cluster of N molecules, there are 6(N-1) degrees of freedom, divided equally between rotation and translation. The rigidity of the molecules in the cluster leads to a complicated intermolecular energy surface which includes subsidiary minima as well as the global minimum. The major difficulty to be overcome in predicting the structure of these clusters through energy minimization is the challenge of searching the large and complex energy hypersurface to find the most stable or global energy minimum.

            In case of proteins, computing the functional conformation of a protein molecule from the amino acid  sequence is difficult for two reasons: the folded conformations are poorly understood  and the space of possible conformations is very large and complex., making it difficult to search for an appropriate free energy minimum. While analyzing the first problem requires detailed models of protein structure, the second problem can be investigated on much simplified models. Here we address the search problem and investigate the potential usefulness of GENETIC ALGORITHM, a recently established search technique, for finding the functional conformation of a protein.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 2

          Optimization  Methods   And    Genetic Algorithms  

        There is a large literature describing various approaches to optimization of functions. Methods which have been used in energy minimization can be divided into three categories, namely,  grid search, numerical, and stochastic. Grid search algorithm look at objective function values laboriously for every point in the space, one at a time. This is a really human kind of search and lacks efficiency. Numerical methods in use require first derivatives(Gradient), and  sometimes second derivatives(Hessian), of energy with respect to molecular rotation and translation. The calculation of these derivatives is time consuming and limits the usefulness of numerical methods especially for larger systems. Stochastic methods include, among others, random search, simplex algorithm and simulated annealing. In stochastic category, random sampling of grid points is rarely used in energy minimization because of the risk of completely missing the global minimum. Some of the most commonly used methods are described in greater detail below.

 

2.1  Steepest  Descents

        In steepest descents(SD), a line search direction is simply taken as the current gradient. This direction oscillates on the way to minimum and many steps will be needed. This inefficient behavior is characteristic of SD especially on energy surfaces with narrow valleys. Convergence is slow near a minimum because the gradient approaches zero, but this method works well even when the starting point is far from the minimum.

2.2  Conjugate  Gradient

        Conjugate gradients(CG) improve the algorithm by minimizing only along directions that are mutually conjugate so that movement along one direction will not counteract progress made in earlier iterations. With SD or CG, first derivatives of the objective function are needed. In general N2   independent data points are required to solve an equation in N variables numerically. Since the gradient is a vector of length N, the best one can hope in a gradient-based minimizer is its convergence in N steps.

 

2.3  Newton-Raphson

        However, if one has second derivative information, in principle, a minimization can converge in one step because the second derivatives form an     N ´ N matrix. This is the principle behind the variable metric minimization   algorithms of which Newton-Raphson(NR) is perhaps the most powerful. NR will  converge in one step if the surface is purely second order. However as elegant as NR algorithm appears, there are several drawbacks to its use in minimization. Firstly, the terms in the Hessian may be difficult to derive  and may be computationally costly. Secondly, when a cluster structure is far from minimum, the minimization can become unstable. This method diverges rapidly if the initial forces are too high( or the surface too flat). Finally calculating, inverting and sorting an N ´ N matrix for a large system can become unwieldy. To successfully locate the global minimum when the system falls into any local minimum on a multi-dimensional surface, numerical methods must be argumented by some stochastic methods such as simulated annealing. Success in reaching a global minimum can only be achieved  by exhaustive  grid search or by making use of more detailed advance knowledge about the shape of hypersurface.

2.4   Simplex  Method

        The simplex(SX) method involves a series of steps. At the beginning of the search, four points on a three dimensional subspace are chosen that form the regular starting simplex. The function value at each of the points is found. Now a newer simplex is created by mirroring the corner with the highest function value through the opposite face of the simplex to a lower point. Then the procedure is restarted. The reflected point should be lower than one of the points in a new simplex selected by a line search. If it rotates about a minimum, the search continues with a shortened simplex baseline. Simplex optimization imposes restriction on the parameters and gives rise to problems like discontinuities in parameter values and other singularities. Also SX optimization may easily get trapped at a subsidiary minimum in the problem hypersurface.

 

2.5  Simulated  Annealing

        Simulated Annealing(SA) is a numerical simulation of a process in which a solid is heated in a heat bath until the particles of the solid can move freely and the solid melts. Then the melt is allowed to cool until it solidifies in a certain arrangement. The heating and cooling are normally repeated many times. The most commonly used algorithm for SA was originated by Metropolis and coworkers. To make use of SA one must have the following elements: (1) Description of possible system configurations. (2) A generator of random changes on the configuration. (3) An objective function E whose minimization is the goal of the procedure. (4) The control parameter T and an annealing schedule which tells how it is lowered from high to low. The control parameter or  temperature starts at a relatively high value and gradually cools down. A large number of Monte Carlo steps are performed at each temperature, typically 100-1000. Each step is accepted with better performance or decrease in energy.  If the energy increases, the step is accepted with a probability  exp(-DE/kT), where DE is the change in energy, T is the temperature, and k is the Boltzman constant.

2.6 Genetic Algorithm(GA)

            GA uses random choice to guide an intelligent and highly exploitative search of difficult-to-optimize functions. Inspired by the Darwinian notion of evolution in which only the fittest survive, this algorithm crossbreeds trial solutions and allows only the “fittest” solutions to survive after several generations. By maintaining a multipoint perspective on many regions of the space, GA has a much higher chance of locating a global minimum. The table below compares GA with other conventional optimization methods.

 

 

 

 

 

 

                 FEATURES                                 SD   CG   NR   SX   SA   GA 

Derivative calculation can be avoided ?                                               

Both local and global minima can be found?                                         

Both continuos and discrete search applicable?                                      

Overall robust?                                                                                    

Learning to adapt to the environment?                                                                      

Parallel Computing                                                                                                       

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

             CHAPTER 3

                   What Is A Genetic Algorithm ?

 

3.1 Natural Evolution : The Initial Inspiration

        Genetic Algorithms were invented to mimic the processes  observed in natural evolution. John Holland, the inventor incorporated some of these processes in solution of various problems.

        The mechanisms that drive the evolutionary processes are not fully understood, but some of its features are known. Evolution takes place on "Chromosomes"-the  organic devices for encoding the structure of living beings.  A living being is created partly by a process of "decoding" chromosomes. The specifics of chromosomes are not fully understood but some general features of the theory are widely accepted.

·        Evolution is a process that operates on chromosomes rather than on the living beings they encode.

·        Natural selection is a link between their performances and their decoded   structures. Process of  natural selection causes those chromosomes that encode   successful structures to reproduce more often than those that do not.

·        Process of reproduction is a point at which reproduction takes place.   Mutation may cause the child of biological children to be different  from   those of their biological parents and combination process may create quite   different children by combining material from chromosomes of their parents.

·        Biological evolution has no memory. Whatever it knows about reproducing   individuals that will function well in their environment is contained in the   gene pool-set of chromosomes carried by current individuals and in the   structures of the chromosome decoders.  These features of natural evolution intrigued John Holland in early   1970's. Holland believed that appropriately incorporated in a computer   algorithm, they may yield a technique to solve difficult problems in a way   nature has done through evolution.

        When Holland began to study these algorithms they did not have a name.   As the field began to demonstrate its potential, it was necessary to   christen it. In reference to its origins in the study of genetics, Holland   named the field "Genetic Algorithms".

3.2  A Top-Level View Of The Genetic Algorithm

        To begin with, let us consider the mechanisms that link the genetic algorithm to the problem it is solving. There are two such mechanisms - a way of encoding the solutions to the problem on chromosomes, and an evaluation function that returns the worth of any chromosome in the context of the problem. We can use GA to carry out simulated evolution on a population of solutions. Following is the general way GA operates.

1. Initialize a population of chromosomes.

2. Evaluate each chromosome in the population.

3. Create new chromosome by mating current chromosomes; apply mutation          and  recombination as the parent chromosome mate.

4. Delete older chromosomes to make room for the new.

5. Evaluate the new chromosome and insert them into the population.

3.3 The Anatomy Of Genetic Algorithm

        GA consists of three principal parts- Evaluation Module, Population  Module, and Reproduction Module. The Evaluation Module contains a evaluation function that measures the worth of any chromosome to the problem on hand. The Population Module contains the population of chromosomes and techniques for creating and manipulating that population. The Reproduction Module contains the techniques for creating new chromosomes during the reproduction.

3.4  Genetic  Algorithm  In Action

            When GA  is run, all its modules work together to effect the  evolution of a population of chromosomes. First initialization technique generates a population of N random bit strings. This population makes up the 'first generation' of chromosome. Next each chromosome in the population is evaluated by the evaluation function, and fitness value for all chromosome are computed by a fitness technique.

            Now GA  begins a series of cycles of replacing the current population of chromosomes by a new population. Each of these cycles produce  a new generation of chromosomes. In each cycle there are N/2 occurrences to pick two 'parent' chromosomes. The Reproduction Module applies one-point crossover and  mutate operator to those two chromosomes to generate two new chromosomes called 'children'. The operator may cause those two children to differ from their parents. After N/2 occurrences of reproduction, N new chromosomes have been produced. These N chromosomes are evaluated, their finesses  are calculated, and they replace the current N chromosomes to form the next generation.

            The generational replacement continues until a break condition is encountered. At this point GA  halts. If all has gone well, a randomly generated population of chromosomes has evolved so that, later generation contains individuals that are better than any found earlier. 

            The interaction between the population module and other modules of GA during a run is very simple. When GA is running, population module asks  reproduction module for a new population. The reproduction module asks  population module for parents to be used in reproduction events. In each reproduction event, the reproduction module gets two parents from population module, applies one point crossover and mutate operator to the parents, and sends the two children created to the population module until  sufficient number of children have been generated to fill a new generation of N chromosomes.

            The population module interacts with the evaluation module during the run, in that whenever a new set of children is produced, the population  module asks the evaluation module to evaluate each new child before the child is  placed in the population.

3.5  Parent Selection, Mutation, And Crossover

            Several components of GA  require further discussion. In this  section we examine each of them in detail.

3.5.1  Roulette Wheel Parent Selection

            The purpose of parent selection in GA is to give more reproductive chances, on the whole, to those population members that are most fit. This type of selection does exactly that.  The procedure goes like this:

1. Sum the fitnesses of all the members in the population; call the result    total fitness.

2. General n , a random number between 0 and the total fitness.

3. Return the first population member whose fitness, added to the fitness  of the preceding population members, is greater then or equal to n.

        Now, consider the following tables:

 


  Chromosome     1     2     3     4     5     6     7     8     9     10

 


  Fitness              8     2    17     7     2    12    11     7     3     7

  Running Total  8    10    27    34   36   48    59    66   69   76

 

 


  Random Number         23   49   76    13    1    27    57

  Chromosome Chosen   3     7    10     3     1     3      7

 

        Table shows a population of ten chromosomes with set of evaluations totaling 76. The first row of table contains index of each chromosome, the second contains each chromosomes fitness. The table also shows set of 7  random numbers generated from 0 to 76, together with the index of the chromosome that would be chosen by roulette wheel parent selection  for each of these numbers. In each case, the chromosome chosen is the first one at which the running total of the fitness is greater than or equal to the random number.

        This algorithm is called roulette wheel selection because it can be viewed as allocating pie-shaped slices on a roulette wheel to the population members, with each slice proportional to the member's fitness. Selection of a population member to be a parent can then be viewed as a spin of the wheel, with winning population member being the one in whose slice the roulette wheel spinner ends up.

 

 

3.5.2  One-Point Crossover And Mutate

        Its function is to cause chromosomes created during reproduction to differ from those of their parents. First the operator makes a copy of the parents and  their two children. Then it applies two functions to those children that may alter them. These functions are  explained in reverse order below.

 

3.5.3 Bit Mutation

        Bit mutation is one procedure carried out by one-point crossover and mutate. When bit mutation is applied to a bit string it sweeps down the list of bits, replacing each by randomly selected bit if the probability test is passed. Bit mutation has a associated probability parameter which is typically quite low(0.008). Thus when a bit mutation is applied to a chromosome during a run of GA , each bit in a chromosome will have eight in one thousand chance of being randomly replaced.

        Now consider the following table.

Old Chromosome

Random Number

New Bit

New Chromosome

1  0  1  0

.801  .102  .266  .373

-

1  0  1  0              

1  1  0  0

.120  .096  .005  .840

0

1  1  0  0

0  0  1  0

.760  .473  .894  .001

1

0  0  1  1

 

This table contains several examples of the operation of bit mutation. We see that for first chromosome probability check is never passed, so in this case the output of bit mutation is same as the input. In the second case probability test is passed for the fourth bit. However the randomly generated bit is same as the original  bit, and so there is no effective change. For the  third chromosome, the probability test is passed for the fifth bit and the new bit differs from the old. Thus the third chromosome in the table is the only one that has changed by bit mutation operator.

 

 

 

3.5.4  One-Point Crossover

        There are other processes that alter chromosomes during reproduction, and some evolutionary biologists believe that these may be at least as important as bit mutation. One of them is called 'crossover'. In genetic algorithm, crossover recombines the genetic material in two parent chromosomes to make two children. John Holland experimented with a crossover operator called 'one-point crossover'. One point crossover when parts of two parent chromosomes are swapped after a randomly selected point, creating two children. One-point crossover is  shown below.

 

Parent 1: 1  1  1  1   1  1                                    Child 1:  1  1  1  0  0  0

                                                  ===>

Parent 2: 0  0  0  0   0  0                                    Child 2:  0  0  0  1  1  1

 


Parent 1: 1  0  1  1  0  1                                     Child 1:  1  0  1  1  0  0

                                                  ===>

Parent 2: 0  0  1  1  0  0                                     Child 2:  0  0  1  1  0  1

 

        The children are made by cutting parent chromosomes at the point denoted by the vertical line and exchanging parental genetic material after the cut. Crossover is an extremely important component of a GA. Many GA practitioners believe that if we deleted crossover from GA it would no longer be a GA. Such claim has not been made for mutation operators. With out the crossover operator GA is just like any other optimization algorithm like dynamic programming, hill climbing etc. Crossover acts as a critical accelerator as the GA runs. To see this let's consider two species of organisms occupying roughly  the same ecological niche and same chromosomal structure. One called the cross and other called the mutes. Set of mutes use only  mutation, and set of cross use both mutation and crossover. Suppose that mutation rate is same for both the species, and suppose further that there are two independent and quite beneficial mutations that will immensely improve these organisms fitness in their environmental niche. Let's call these mutations mutation A and mutation B.

        What happens when mutation A and mutation B occurs in both cross and the mutes. For mutes, members with mutation A or mutation B will likely come to dominate the population, since these mutations will confer added fitness on their predecessors. But it will take long time before any individual comes  to have both the mutations, since an individual with mutation A can acquire mutation B only through mutation, and these mutations are quite unlikely. For the cross, the  situation is different. As soon as the individuals with mutation A and mutation B come to dominate the population, it becomes quite likely that crossover will act to combine them. Thus cross with both the mutations A and B will have an added edge over the mutes. Thus we see that mutation increases the diversity in the population and then crossover spreads the diversity through out the population.

        The point here is that crossover acts to combine the 'building blocks' of good solutions from diverse chromosomes. For example chromosome (011011) contains 32 building blocks including '0##0#','#1###','011#1','#####','01101'. Where # indicates  bit values at those positions which are not important for the solution.

        Holland calls each building block a 'schema'. His investigation into the success of GA led him to conclude that GA manipulate 'schemata' as they run and Holland's Schema theorem asserts precisely this. Let 'r' be the average fitness of all chromosomes in the population containing schema 's'. Let 'n' be the number of chromosomes containing 's'. Let 'a' be the average fitness of all the chromosomes in the population. Then the expected number of occurrences of 's' in the next generation of the population is   nr/a, minus the disruptions caused by mutation and crossover. This feature of GA has been described by Holland as the 'intrinsic parallelism', in that the algorithm is manipulating large number of schema in parallel.

 

 

 

3.6  The Inversion Operator

        This operator acts in a way that it inverts the order of elements between two randomly chosen points. Thus unlike one-point crossover, it takes one parent and returns one child. While this operator was inspired by a biological process, it requires additional overhead and it has not in general been found to be useful in GA practice.

 

3.7  Cost Function,  Decision  Variables  And  Search  Space

            In most of the practical optimization problems, the aim is to find the optimal parameters to increase the production and/or to reduce the expenditure/ loss , that is to get maximum profit by reorganizing the system and its parameters. Since this will finally reflect the cost this is called the cost function.  A carefully written and convergent computational algorithm would eventually find an optimum solution. The parameters of the system that decide the cost are called decision variables. The search space is Euclidean in which parameters take different values and each point in this space is a possible solution of the problem.

3.8 A  Simple  Genetic Algorithm  Problem

            Consider the problem of maximizing the function

                        f(x) = x 2 -64x + 100

This function has a global maximum value of 100 at x = 0, as can be easily computed. To use a genetic algorithm, decision variables of the problem have to be coded in strings of finite length. For this problem, we can encode the variables as a binary string of length 6. We create an initial population with 4 samples by randomly selecting them from the interval 0 to 63 and encode each sample. A binary string of length 6 can represent any value from 0 to 63(26 - 1); hence string length is chosen as 6 for example. Four encoded samples in the initial population are:      

                                    5 = 0 0 0 1 0 1

                                    60 = 1 1 1 1 0 0

                                    33 = 1 0 0 0 0 1

                                    8 = 0 0 1 0 0 0

These individuals are sorted according to their fitness values.( They are arranged in the descending order of their fitness values). In this case, fitness value is same as the cost function value. The sorted individuals are given as :

No.

x

String

Fitness

1

60

1 1 1 1 0 0

-140

2

5

0 0 0 1 0 1

-195

3

8

0 0 1 0 0 0

-348

4

33

1 0 0 0 0 1

-923

 

In the 1st generation, the 1st and 2nd strings are corssed over at site 3 (crossover stie randomly selected) to get two new strings:

 

crossover site             newstrings          fitness of new strings

        

1 1 1 1 0 0              1 1 1 1 0 1                 -83

0 0 0 1 0 1              0 0 0 1 0 0                 -140

                       

 

Similarly 3rd and 4th strings are crossed over at crossover site , to get:

 

crossover site         new strings             fitness of new strings

           

0 0   1 0 0 0           0 0 0 0 0 1                                     37

1 0   0 0 0 1           1 0 1 0 0 0                                     -860

 storing these new individuals we get

No.

x

String

Fitness

1

1

0  0  0  0  0  1

37

2

61

1  1  1  1  0  1

-83

3

4

0  0  0  1  0  0 

-140

4

40

1  0  1  0  0  0

-860

 

 

We see that in one generation fitness is improved from -140 to 37 (f(60) < f(1)). Before proceeding to the next step, the weakest member in the ppulaion is replaced by the fittest member of the previous population, that is, string 1 1 1 1 0 0, whose fitness is -140.

 

In 2nd  generation, the 1st and 2nd strings are crossed over at site 1, to get:

crossover site               new strings                   fitness of new strings

0  0 0 0 0 1                  0 1 1 1 0 1                               -195

1  1 1 1 0 1                  1 0 0 0 0 1                               -923

 

Similarly 3rd and 4th strings are crossed over at crossover site 3 to get:

crossover site               new strings                   fitness of new strings

0 0 0  1 0 0                  0 0 0 1 0 0                               -140

1 1 1  1 0 0                  1 1 1 1 0 0                               -140

           

Replacing the weekest member by the fittest member of the previous population (string 1 0 0 0 0 1 with fitness value of -923 is replaced by string 0 0 0 0 0 1 with fitness value of 37) and storing them according to the fitness values, we get:

 


            No.            x                            string                                     fitness    

 


            1               1                             0 0 0 0 0 1                               37      

            2               4                             0 0 0 1 0 0                               -140

            3               60                           1 1 1 1 0 0                               -140

            4               29                           0 1 1 1 0 1                               -915

 

In the 3rd generation, the above process  of crossover (at site=4) is repeated (not shown here). The new set of strings in the population, after replacement of the weakest by the fittest member of the previous population is given as:

 

 

 

 


            No.                    x                         string                                 fitness    

 

            1                         0                      0 0 0 0 0 0                        100                                                                                   

            2                        1                    0 0 0 0 0 1                       37 

            3            61                   1 1 1 1 0 1                         -83

            4                        5                       0 0 0 1 0 1                        -195

 

For simplicity, the probability of mutation is chosen as 0(zero). We see that as the  genetic operations continue from one generation to the next, improved solutions evolve. At x = 0, f(x) = 100, which is the desired result. Here, the problem which could have been solved using conventional computational optimization algorithms, is used to illustrate GA operations for the sake of simplicity.

            In general f can be any function of x,y,z..... . But in our case it is the energy function, because we are interested in minimizing the energy.

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 4

                    Performance  Enhancement    Of   GA

 

        In the remainder of this section we investigate four improvements to the performance of GA.

There are various ways to enhance the performance of the GA, which are described as follows:

Original Evaluation                           200        9         8        8        4       1

Fitness is Evaluation                          200        9         8       8        4        1

Windowing with Minimum = 0          199        8         7        7        3       0

Windowing with Minimum = 10        199       10       10      10      10      10

Linear Fitness With Decrement = 1    100       99       98      97       96     95

Linear Fitness With Decrement = 20  100       80       60      40       20      1

        We have constructed the data in above table so that two common features of GA runs are exemplified. The first is the existence of a chromosome that is much better than its nearest competitor. This chromosome - 'a super individual' -will probably eliminate all its competitor. In a generation or two when fitness is equal to the evaluation, the  super individual will have had very few chances to try recombination with other population members, and the population will probably experience rapid convergence. The second feature is that there are several individuals with evaluations that are very close. GAs often finds sets of close competitors, especially at the end of their runs. When there are close competitors, we may well want to increase the amount of selection pressure, so that the best competitors are encouraged to reproduce more often.

4.1  Elitism

        We noted earlier that the best member of the population may fail to produce offspring in the next generation. The 'elitist' strategy fixes this potential source of loss by copying the best member of each generation into the succeeding generations. This strategy may increase the speed of domination of a population by a super individual, but on balance it appears to improve genetic performance.

4.2  Steady-State Reproduction

        When GA reproduces, it replaces its entire set of parents by their children. This generational replacement technique has some potential drawbacks. One is that even in an elitist strategy, many of the best individuals found may not reproduce at all and their genes may be lost. It is also possible that mutation or crossover may alter the best chromosomes genes so that whatever was good about them is lost. Neither of these outcomes is desirable. It works as follows.

1. Create  n  children through reproduction.

2. Delete  n  members of the population to make room for the children.

3. Evaluate and insert the children into the population.

4.3  Steady-State Without Duplicates

        Steady state with duplicates is a reproduction technique that discards the children that are duplicate of the current population rather than inserting them into the population. When we use this technique, every member of the population will be different. This technique allows much more use of  our allotted number of chromosomes by guaranteeing that reproduction never creates duplication in the population.

4.4  Windowing

        Find the minimum evaluation in the population. Assign each chromosome a fitness equal to the amount that it exceeds this minimum. Optionally, a minimum amount greater than the minimum value may be created as a guard against the lowest chromosomes having no chance at reproduction.

4.5  Linear Normalization

 The chromosomes are arranged in decreasing order of evaluation. The  fitness values are assigned a constant initial value which decreases in a  linear fashion as GA progresses. The constant value and  the rate of decrement are the parameters of this technique.

 

CHAPTER 5

                            Implementation And Results

        We wish to develop an implementation of GA suitable for protein folding. Thus, we seek to use th simplest model that still captures the essence of the important components of protein folding. The linear sequence is composed of equally spaced amino acids  of only two types: hydrophobic(1) and hydrophilic(0). This sequence is folded on a two dimensional square lattice on which at each point the chain can turn 90o left or right, or continue ahead. Figure below shows possible conformations of the 20 amino acid molecule : 1-0-1-0-0-1-1-0-1-0-0-1-0-1-1-0-0-1-0-1. The energy function is so chosen that it takes on a value of -1 for each direct contact(occupying neighboring non-diagonal lattice points) of non bonded hydrophobic amino acids. Under this energy function, low energy conformations are compact with a hydrophobic core: since hydrophobic-hydrophobic interactions are rewarded, the hydrophobic-1 residues tend to be on the inside of a low energy structure, while the hydrophilic-0 are forced to the surface follows. This energy function was chosen to mimic what happens in nature. Fully folded proteins always have a hydrophobic core buried inside and the hydrophilic amino acids are exposed to the surroundings.

           

           

                        o     o           

                        n      n     o    o    o                              -  0

                        o     n     n     n     n                               -  1

                        o     n     n     n     o

                        o     n     o

 

Energy Evaluation  =  -9

           

 

 

 

             o     n     o

             n      n    o

                      n     o                     

                              n

                              o    o

                                    n     o     o      o          

                                               n     n         n

                                                        n       o                                        

                         Energy Evaluation = -4

            In implementing GA one has to choose the appropriate method of encoding the data, the size of the population, the specific manner of applying the genetic operators, and the population pruning scheme. Our implementation of GA is unique in that the solutions are not encoded as binary strings but rather are the  confirmations themselves which are treated directly in the  sprit of genetic operators. The process starts with N extended structures. In each generation each structure is subjected to number of mutation steps. At the end of this mutation stage the crossover operation is performed. The chance p(Si) of a structure being selected for crossover is directly proportional to its energy value Ei, that is :

                p(Si)=Ei / (ĺ Ej ; j running from 1 to N).

Thus the lower confirmation energy have a higher chance of being selected. For pair of selected structures a random point is chosen along the sequence and the N-terminal portion of the first structure is connected to C-terminal  portion of the second structure as follows.

 

                       

 

 

 

 

 

 

(A)                                          (B)                                          (C)

            In the above example, the cut point was randomly chosen to be after residue 14. Joining the first 14 residues of (a) with the last 6 residues of (b) and applying a randomly chosen 270o  rotation at that point achieves the compact structure in (c). In this case energy value of the hybrid (c) is -9 .,lower than the energies -5 and -2 of  its “parents” . The hybrid was accepted if its energy is lower than averaged energies of is parents, or non deterministically accepted according to its energy increase.

            As there are three ways to join the parts together(connecting the chains with the angle of 0o, 90o or 270o), these possibilities are listed in random order to find one that is valid(that is where no residue from one structure occupies a lattice point used by a residue from the other). If none of the three ways lead to a self avoiding structure, then another pair is selected. Once a valid structure Sk is created, its energy Ek is evaluated and compared to the average energy    Eij=(Ei+Ej)/2   of its 'parents'. The structure is accepted if    Ek<=Eij,   or if the energy will be increased based on the decision:

                                    Rnd < exp[(Eij-Ek  -Pij  +Pk)/Ck].

Where Rnd is a random number between 0 and 1 and Ck is a parameter so chosen that it maintains the diversity of the population and prevents premature convergence to low energy conformations. Since the low energy conformations have a compact structure with rows and columns of amino acids,  the evaluation function gives higher weightage to those structures with higher schema(patterns of regularly arranged rows and columns). Pij  and Pk  reflect the corresponding schema of the parents and the child.

            This crossover operation is repeated until N-1 new accepted hybrid structures have been constructed to constitute the population of the next generation. In addition, the lowest energy conformation in each generation id directly replicated in the next generation. We allow higher acceptance rate for bad moves that increases the energy for mutation steps then for crossovers.

            The simulation was done using  four  sets of chains having 20, 36, 48 and 60 amino acids respectively. Separate programs were written to take care of these four sets. The input to these program is the population of  random solutions which were generated separately. The input data file includes the X and Y coordinates of each amino acid in the chain and its nature, hydrophobic or hydrophilic. The output data file gives the new coordinates of the hydrophobic and hydrophilic units which corresponds to the minimum energy configuration.

5.1  Results

            The energy minimization was done for four different cases, varying the length of amino acid units. The four different lengths  of amino acid were 20 units,36 units,48 units, and 60 units. The results obtained are as follows for hydrophobic(1) and hydrophilic(0)

5.1.1  20 Units

            Input Chain : 10100110100101100101

            Energy minimum: -9

            Conformation:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.1.2  36 Units

            Input Chain:000110011000001111111001100001100100

            Energy minimum: -14

            Conformation:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.1.3  48 Units

            Input Chain:001001100110000001111111111

000001100110010011111

            Energy minimum: -22

            Conformation:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.1.4  60 Units

            Input Chain:001110111111110001111111111010

001111111111110000111111011010

            Energy minimum: -34

            Conformation:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

           

 

            The disk attached contains the code and the data. The source codes corresponding to the various chain lengths 20, 36, 48 and 60 are pro20.for, pro36.for, pro48.for, pro60.for. The corresponding input files are pro20.in, pro36.in, pro48.in and pro60.in ; the output files are en20.out, en36.out, en48.out and en60.out. The codes are compiled using MS Fortran Powerstation on a Windows based PC.

           

 

            File Path                                                         Description                

            .\input\*.in                                                        Input Files

            .\code\*.for                                                      Fortran source code

            .\output\*.out                                                    Output files

           

             

      

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

REFERENCES

1.         Mitchell A. Potter (1997) - The Design and Analysis of a          Computational Model for Cooperative Coevolution. Ph.D. thesis,             George Mason University, Fairfan, VA.

2.         Jerry Bala, Kenneth A. De Jong, Jeffrey Huang, Haleh Vafaie, and

            H. Wechsler (1997) - Using Learning to Facilitate the Evolution

            of Features for Recognizing Visual Concepts. In Special Issue of           Evolutionary Computation - Evolution, Learning, and Instinct : 100

            Years of  the Baldwin Effect, pages 297-311, MIT press.

3.         William M. Spears (1997) - Recombination Parameters. In the

             Handbook of Evolutionary Computation. T. Baeck, D. Fogel

            and Z. Michalewichz (editors), IOP Publishing and Oxford University

            Press.

4.         K. Deb and William M Spears (1997) - Speciation Methods in the

            Handbook of Evolutionary Computation. T. Baeck, D. Fogel and

             Z. Michalemicz  (editors),  IOP Publishing and Oxford University

            Press.

5.         Jayashree Sarma and Kenneth De Jong (1996) - An Analysis of the      Effect Neighborhood Size and  Shape on Local Selection Algorithms.

            Proceedings of  the Fourth International Conference on Parallel

            Problem Solving fro`m Nature. Sept. 22-26, 1996, Berlin, Germany.

6.         Mitchell A. Potter and Kenneth A. De  Jong (1995) Evolving Neural

            Networks with Collaborative Species. In Proceedings of the 1995         summer  Computer Simulation Conference. July 24-26, Ottawa,      Ontario, Canada , Pages 340-345. The Society for Computer

             Simulation.

7.         Yongliang Xiao and Donald E. Williams( July 1993) - Genetic

            Algorithm: A New Approach to the Prediction of the Structure of

            Molecular Clusters, Chemical Physics Letters, Vol 215, Number

            1.2.3, Pages 17-24.

8.         Ron Unger and John Moult (July1992) - Genetic Algorithms for

            Protein Folding Simulations, Journal of Molecular Biology, Vol 321,

            Pages 75-81.

9.         Jitendra R Raol and Abhijit Jalisatgi (Aug 1996) - From Genetics

            to Genetic Algotirhms - Solution to Optimisation Problems Using

            Natural Systems, Resonance,  Vol 1 Number 8, August 1996,

            Pages 43-54.

10.       Lawrence Davis(Ed. By) - Handbook of Genetic Algorithms,1991

            Van Nostrand Reinhold - New York .