Skip to main content.

FunctionFinder

A symbolic genetic algorithm to find interpolating functions

 

Abstract

The problem of interpreting behavior of customers of the bank I work in has moved me to the need to find a behavioral model. There are different ways to extrapolate a model from rough data; cluster analysis and theory tests are some of them. I hadn't a database structured well enough to feed a cluster analysis, and had no trace of a theory of behavior.

For this reason, I developed an algorithm capable of generate a non-linear function in n-variables, interpolating supposed results from given sets of variables and known results. In other words, the genetic algorithm in question doesn't find a solution; it reveals a theory generating the data.

Thesis

Let's assume that we have a vector of k statistical couples Z, which assumes the form of:

Z=[ (X1, y1), (X2, y2), .. , (Xk, yk) ]

Where Xi is a vector of k variables Xi=[x1, x2, .., xn] and yn is an arbitrary real number.

We want to find an f(Xi) which is a theory that binds Rn to R1 (or we could say f(X): Rn ---> R1) so that f(Xi)=yi.

Note that a simple genetic algorithm is capable to find out that function f but it will not "say" what the function is. We could compare a genetic algorithm that evaluates an yk starting from an Xi as an "operator": an operator is a discrete system with no memory (no status variables), that receives in input a series of data, and learns to synthesize a "right" as value, like that: M(Xk) = yk. That is good if you want to interpolate the series, and you want to predict what will be a result when unknown input conditions are given. But is useless if you are willing to know what is the law of transformation between the output and the input.

To clarify the f(), the genetic algorithm must be symbolic.

That means that the genetic algorithm will not simply find a solution: it will find a way to find the solution, and it will tell us how to do. It will explain the rules to get a coherent result in a language familiar to us. The difference is that a simple genetic algorithm is a mechanism capable of obtaining a result, while a symbolic genetic algorithm is like a teacher, that once found a solution, will tell us how the solution have been reached.

The semantic value of this approach is three-dimensional, compared with the two-dimensional value of information given us by a simple genetic algorithm. In example, we could teach (with the same efforts) to a simple and to a symbolic genetic algorithm to play chess. The first one will learn how to beat us. The second one will tell us how to beat ourselves. That "how" makes great difference, because no information can be extrapolated from the first genetic, while in the latter case we'll have a way to increment how knowledge stock.

In that sense:

Symbolic genetic algorithms are knowledge producers

 

while simple genetic algorithms are information producers. The difference is that knowledge is (or can) be shared, while information generated by simple genetic algorithms is retained inside themselves, and can be "externalized" and transformed in knowledge only after a complex analysis (if it is ever possible).

 

The symbolic agents

First of all, while an "old styled" genetic agent behavior is based upon "genes", which are represented by integer numerical values, the "life base" of a symbolic agent is the self-meaning symbol. The locution "Self-meaning symbol" is here to say that the material used as genes (that regulates the behavior of the agent) has a symbolic value that is meaningful for the agent itself. We can hypothesizes that the symbol carries an intrinsic meaning by itself.

Every genetic algorithm has an user. If we can exclude the user form the analysis when the genetic is not symbolic, the knowledge generating symbolic genetic agent must have someone that receives the produced knowledge. Well, information becomes knowledge when it's shared across subjects, so there is no sense in saying that some knowledge is produced, if we don't put the user of that information material in the model.

So, the "intrinsic" meaning of the symbols used by the agent can be shared with the user, or can be useful only for the agent itself; in any case, there will be a way to translate the symbol in a human readable form, if the meaning carried by the symbol can't be immediately received by the end user.

The interesting thing is that a self-meaning does not exclude the ability to create a meta-meaning for a group of related symbols. The word Blue, i.e., carries several self-meaning; which of them will be used, is determined by the context, which is a meta-meaning. I could say "She has deep blue eyes", and "Today, she feels blue". The self-meaning of blue, in these two sentences is different because the meta-meaning (the context) is different.

That is to say: as the simple genetic algorithm has a complex behavior, that doesn't directly depends on the single genes, but on their joint actions, so the symbolic genetic agents can create meta-meanings that are a combined result of the single self-meanings carried by the symbols.

The Function Finder algorithm

I will now illustrate how I obtained a working Function Finder, or a working symbolic genetic algorithm. I will assume this as an example of the potentials of this method; the algorithm is a poor function finder, and could be empowered and optimized in many aspects. But it works, and it should be a good base as a starting point.

The algorithm I developed is based upon a representation of mathematical symbols. Genetics agents have a genetic code that is composed by the following symbols:

  • Variables (one for each variable of the vector X)
  • Constants (each constant in the genetic code is different)
  • Arithmetical signs ('+', '-', '*', '/')
  • Nth Power
  • Natural logarithm
  • Negation

Precedence of the symbols (parenthesis) depends on how the sequence is mounted, or, if you prefer, on the context (meta-meaning).

On startup, a standard population is created. First of all, we create some agents with codes meaning base interpolation models: linear, quadratic, geometrical, and so on. The genetic code of the firsts agents will signify:

x1*k1+x2*k2+ .. + xn*kn

k1*x1^2+ k2*x2^2+ .. + kn*xn^2 and so on.

To fill up a base population (30 to 50 agents are a good match) the remaining agents are created completely randomly.

Then evolution is started.

Each turn, every agent is "evaluated": the "meaning" of it's genetic sequence is calculated (as a mathematical expression), feeding the agent with the Xi vector values. The result is compared with the corresponding yi value. This is repeated with all the k X vectors and the k y results; the square distances between the evaluated results and the ys are summed up; then the square root of this sum is reported to the mean of the ys.

In this way, every agent obtains a valuation of it's "suitability" to inference the ys starting from the Xs, expressed as a percent mean error. The lesser is the error, the better is the agent.

The world in which agents move in, (a mathematical space), hasn't any random element; for this reason, survival of the "wrong" agents is limited to a small number of turns. In my experiments, I got the best results with the lifespan of the worst elements limited to 0. I let only a part of the population to survive: the best 33% agents will carry on to the next turns.

After the selection has been performed, here comes the reproduction. I adopted a cross over reproduction algorithm with some degree of random mutation. Two Random parts of genetic code, coming from two random picked up survived agents, are combined to form a new agent. If the population is not complete after this step (setting the survival rate to 33% this is always true), the best agent is reproduced with a random mutation, then the second best is reproduced and so on, until we fill up the population. More over, in every generation I put in some random generated agents (one to 10% of max. population), so to maintain the degree of variety high.

The process is started over, until an agent gets 0,00% of mean error or until an iteration number limit is hit. I found that 10.000 iterations are often enough to reach a good result.

The winning agent is the one that has the lower mean error.

Constants optimization

Since constants are one of the genotype used in generating the symbolic agents, particular care must be taken. The first version of the algorithm I developed created "fixed" constants; an agent that carried a linear interpolation model would have been represented as "4*x1+12*x2..."; casual mutation had the duty to optimize the functions. In other words, if an agent mutated a constant, and had a lesser error value, it would have survived.

The result has been the following: whole populations become simple variant of a given function. The population reproduced over and over the best agent found in a early stage of the process, varying the given constants.

To avoid this problem, I had to bring constants to the rank of variables. When a new agent is born, the algorithm needs to optimize it's constants. From a starting random value, constants are brought to a best-fit value, using a linear optimization algorithm. This means that an agent carries a family of functions (determined by the attribution of a concrete value to the constants), and the best fitting one will "represent" the whole family.

The problem is: in a non-linear multidimensional worlds, there is no guarantee that a local optimum of a function may be the best global optimum (this phenomenon gets the name of "actractor"). An agent could have been ill-optimized, so that it's mean error figures (let's say) of 10%, while a better optimization (a better allocation of concrete values to the constants) could result in a mean error of 1%.

I didn't deal directly with this problem, but used randomness to lesser it's consequences: each time an agent is derived, I give different random values to the constants. In that way, optimization --could-- bring to different (hopefully better) results.

A more robust algorithm should handle this problem in a more robust fashion.

Function generation

To simplify and speed up the operations, I used this method: in each agent, the symbols are internally codified as a sequence of integer values; an "interpreter" transforms this sequence in a sequence of mathematical symbols. This is not the best way to handle the process because it adds a dimension of randomness which is not fully predictable in advance.

I.e. it's difficult to say what exactly will be the result of a cross over mutation. If you cross over "a*x1+b*x2" with "x1/a - x2/b" you could expect to get out something like "x1/a + b*x2". But with the interference of the translator, you could get an odd result, like "x1/a *log x2*b".

I don't know if this added random variety is good for the final result; for sure, I am willing to write a better symbol handling algorithm, and to paragon the two.

Symbolic optimization

The algorithm implements symbolic optimization, so that "log e^xn" is transformed into "xn", and "-(-xi)" is translated to "xi". Furthermore "xa - (-xb)" is translated to "xa + xb", and "(x^a)^b" will be transformed into "x^(a*b)".

Agent filtering

The above optimization is useful because I found that, without symbolic reduction, the algorithm tends to span equivalent agents through the population. In a way that resembles the constant optimization problem, when a good enough agent were found, the algorithm generated a series of mutated agents that had the same result of the first one, but written in different algebraic forms: ie. "x3+x2", "x3 - (-x2)", "x3 - (- log e^x2)" and so on.

Since the world in which the agents are moving (a given set of data) is deterministic, two equals or equivalent genotypes are only a reduction of variety of the population, that must be avoided at all costs. Before to let a newborn genetic agent in the population, the algorithm check for the existence of an equal or equivalent agent; if there is already a "copy" of the newborn genotype in the population, the new agent is discarded.

More than this, if there is an agent that "fails" some instantiation with the Xi vector (having some impossible result for same value of Xi, i.e. sqrt(x) for x < 0), the agent is given a strong penalty that will surely bring it to the very bottom of the population error ranking, causing it's death (provided that there are enough non-failing agents).