Neural network
with learning by backward error propagation
Colin Fahey
A biological neural network
1. Software
2. Introduction
This document describes how to implement an artificial neural network that is capable of being trained to recognize patterns.
This document describes a model of a neural network that learns by an algorithm that uses "backward error propagation".
This document includes basic demonstrations of learning by "backward error propagation". This document has a link to computer code. The computer code includes the demonstrations. The computer code can be used to create complex neural networks. However, the computer code is only for demonstration purposes. An alternative implementation could reduce memory usage and could increase speed.
3. Alternative to learning by backward error propagation
This document describes a model of a neural network which learns by an algorithm named "backward error propagation". This algorithm can require a very long time to learn various lessons. Also, this algorithm can randomly fail to learn various lessons due to the random initial status of the neural network before training.
Learning by "associating active inputs" is an important alternative to learning by "backward error propagation". Learning by associating active inputs simply associates inputs that are simultaneously active. Such learning can be fast and reliable. However, for many practical purposes, there is no obvious way to use a neural network that learns by association, whereas there is an obvious way to use a network that learns by backward error propagation.
Some biological neural networks are known to learn by association of active inputs. Backward error propagation has not been observed in any biological neural network.
This document describes interesting uses for a neural network that learns by backward error propagation. However, learning by association is a very important alternative algorithm for learning. Designing a neural network that learns by association to solve a particular problem might be more difficult that desgining an alternative neural network that learns by backward error propagation, but biological systems learn by association, and the learning ability of biological systems is evident.
4. Biological neuron
4.1 Neuron cell
A biological neuron ("multipolar" type, ~4 um cell body)
A neuron is type of cell that has the ability to receive and transmit nerve signals.
Neurons are the basis of nerve systems, found in animals, birds, fish, and insects.
Brains with memory and logic, and simple reflex systems, are both based on arrangements of neurons.
Neurons are also used to convey signals over long distances in a creature's body, such as from sensors to the brain, or from the brain to muscles.
The behavior of a biological neuron is very complex, but the following simplified description captures the basic principle:
The neuron accumulates signals received from other neurons, and if the total signal accumulation exceeds a threshold, the neuron transmits its own signals to other neurons.
4.2 Neuron parts
Parts of a biological neuron
Soma

The cell body of a neuron

Dendrites

Fibers with chemical receptors (inputs) that extend from the cell body of a neuron. A neuron typically has many dendrites, and dendrites often have many branches.

Axon

A fiber with chemical emitters (outputs) at its endpoint that extends from the cell body of the neuron. A neuron has a single axon, and the axon usually has very few branches.

Synapse

A configuration such that the axon of one neuron and the dendrites of another neuron are separated by a very small gap. In such a configuration, chemicals emitted by an axon of a neuron cross the synapse and are received by the dendrites of the other neuron. This is how neurons influence other neurons.

4.3 Neuron firing
A neuron accumulates chemical signals from its dendrites, and if the total chemical accumulation exceeds a threshold within a period of time, the neuron "fires", sending its own signal through its axon.
Some neurons are capable of firing pulses on the order of 100 Hz.
The signals passing through neurons involve accumulations of Sodium (Na), Potassium (K), and Chlorine (Cl) ions, and a resulting electrochemical potential (i.e., voltage).
The resting voltage (70 mV) and firing voltage (+30 mV) can be measured or even influenced by conventional electrical circuitry.
The following is a voltage recording of a rat neuron firing at a rate of roughly 100 Hz when a single whisker is touched and held out of its resting position:
A rat neuron firing (100 Hz) due to holding a whisker.
Although the stimulus is constant, the neuron signal is rapid pulsing.
4.4 Neural network
The human brain has approximately 10^11 (100 billion) neurons.
Each neuron in the cerebellum receives input from as many as 10^4 (10000) synapses.
Although the axon and dendrites of a neuron often extend only a few micrometers away from the cell body, some axons are on the order of a meter in length.
A brain has neurons with relatively short axons grouped in areas or clusters.
A brain also has bundles of neurons with relatively long axons to link areas separated by centimeters.
Thus a hierarchical network of processing elements is formed.
4.5 Neural network status
The status of a network of neurons is both the way the neurons are connected and the signals at all of the synapses.
It is unclear how much status information would be lost if a brain was tranquilized in to total inactivity for any amount of time.
One can imagine information sustained only by signals moving through the network, and not by network connectivity itself, like cellular automata simulations like Conway's "Game of Life", simple Dynamic Random Access Memory (DRAM) chips, and echoes in a chamber.
4.6 Neural network Learning
Conventional learning occurs when the properties of dendrites change at a synapse to become more or less efficient at receiving chemical signals from an axon.
The reasons for such changes are complicated, but the result is that a neuron requires a different combination of synapse inputs to trigger an output signal.
5. Artificial neuron
5.1 Definition
An "artificial neuron" is an algorithm or a physical device that implements a mathematical model inspired by the basic behavior of a biological neuron.
A neuron accumulates signals received from other neurons or inputs (e.g., sensors), and if the total signal accumulation exceeds a threshold, the neuron transmits a signal to other neurons or outputs (e.g., effectors).
Any mathematical model that incorporates the idea of accumulating multiple inputs and yielding a single output (that accentuates the relative intensity of the input relative to some nominal level) can be used for pattern recognition.
Such models can be the basis of an artificial neuron.
If the influence of each input can be modified, then the model can support learning.
5.2 Activation function
An "activation function" is a mathematical function that converts input values below a particular value to a relatively low output value, and converts input values above a particular value to a relatively high output value.
An "activation function" is used to convert the weighted sum of input values of a neuron to a value that represents the output of the neuron.
A "sigmoid" function is a general class of smooth functions that asymptotically approach a lower limit for input values approaching negative infinity, and asymptotically approach an upper limit for input values approaching positive infinity.
One specific sigmoid function is the "logistic sigmoid" function:
The "Logistic Sigmoid" function: 1 / ( 1 + Exp( x ) )
The "logistic sigmoid" function can be used as an "activation function" for a mathematical model of a neuron.
The mathematical derivative of the "logistic sigmoid" can be computed as a formula, making it easy to compute an associated learning formula.
5.3 Neural network input
A "neural network input" represents an input to a neural network.
Neural network input
"Input" is the numeric value of the input.
5.4 Neural network output
A "neural network output" represents an output of a neural network.
Neural network output
"Output" is the numeric value of the output.
"Error" is a numeric value that represents the difference between the output value and a "Desired" value:
Error = (Output  Desired); // Derived from: Output = Desired + Error;
The "Desired" value represents a desired value, or an ideal value, or a correct value, that the neural network should produce as an output in response to specific inputs.
The error value is computed and assigned to "Error" by a training algorithm.
The error value is feedback to the neural network.
The neural network can adapt to reduce the difference between its outputs and the desired values; i.e., the neural network can learn, and can thus reduce future errors.
5.5 Neuron body
A "neuron body" represents the body of a neuron, which accumulates input contributions, and adds a bias, and transforms the resulting value by the "activation function" to produce an output value.
Neuron body
"InputAccumulator" is a value that represents the accumulated input from neuron links whose outputs are connected to the neuron body.
"Bias" is an adjustable value that is combined with the accumulated input value.
"Output" is a numeric value representing the output value of the neuron.
The output value is computed using the following formula:
Output = ActivationFunction( Bias + InputAccumulator );
"ErrorAccumulator" is a numeric value representing accumulated error.
Given a specific output value of the neuron body, and given a specific output error value, the accumulated error value is adjusted according to the following formula:
ErrorAccumulator += Output * (1  Output) * OutputError;
"Rate" is a value that affects how the "Bias" value changes in response to the "ErrorAccumulator" value:
Bias += (1) * Rate * ErrorAccumulator;
5.6 Neuron link
A "neuron link" represents a link between:
(1) an input of the neural network and an input of a neuron body;
or,
(2) an output of a neuron body and an input of another neuron body;
or,
(3) an output of a neuron body and an output of the neural network.
Neuron link
"Input" is a cache of the input to the link.
"Weight" is an adjustable value that affects how signal values and error values propagate through the link.
"Output" is a cache of the output of the link.
The value is computed using the following formula:
Output = Weight * Input;
"Error" is a cache of error of the link.
"WeightedError" is a cache of the error of the link, weighted by the weight factor:
WeightedError = Weight * Error;
"Rate" is a value that affects how the "Weight" value changes in response to the "Error" value and the "Input" value.
During neural network learning, the "Weight" value is adjusted in the following manner:
Weight += (1) * Rate * Input * Error;
5.7 Neural network
A "neural network" contains inputs, outputs, neuron bodies, and links.
The following image depicts a simple neural network, with two inputs, and two neuron bodies in a first layer, and a single neuron in a second layer, and one output.
Example of a neural network
During a simulation of a neural network, input values propagate forward through links and neuron bodies, and eventually arrive at outputs.
Example of forward propagation in a neural network
During training, error values are provided at the outputs, and these errors propagate backwards through the neural network, resulting in the modification of weights and biases in neuron bodies and links.
Example of backward error propagation in a neural network
5.8 Neural network simulation
Definition:
"Network simulation" is the procedure used to propagate network inputs through the links and neuron bodies until reaching the network outputs.
Network simulation involves the simulation of all of its constituent links and neuron bodies.
Simulations without loops or time:
There are many possible network configurations involving loops.
There are many neuron models that depend on time.
But some of the most common applications of artificial neurons involve neither loops nor time.
The following is a mathematical model of a neuron body:
Output = ActivationFunction( Bias + InputAccumulator );
With this neuron model, and a network without "loops", we simply start from the external inputs, compute outputs of the first layer of neurons, and supply those outputs as inputs to the next layer, compute outputs for that layer, and continue through layers of neurons until the final outputs are computed.
Loops:
A network can have connections in the form of loops (or "cycles").
For example, the output of a neuron can be connected directly to an input of that same neuron, causing "feedback".
Another example is the output of neuron #1 being connected to the input of neuron #2, and the output of neuron #2 being connected to the input of neuron #1.
If you can start from some point in a network and trace a path through neurons and connections, obeying the oneway flow of the signals, and eventually arrive at that same starting point, then the path is a loop.
Loops introduce the interesting possibility of signals flowing around the network for indefinite periods of time.
Some simple models assume that it takes a specific amount of time for signals to pass through individual neurons.
In such models, signals circulate through loops with few neurons faster than signals circulate through loops with many neurons.
A neuron connected to itself will have the fastest signal circulation rate.
If a neuron has an input X, a weight W, a bias B, and a nonnegative output Y (e.g., 0.0 > 1.0), then we can form an oscillator simply by setting W = (8) and B = +4 and connecting Y to X;
each time we simulate the neuron, the signal will be toggled to the opposite state.
A network with loops can be busy with activity even when it does not accept external signals (stimuli) as inputs.
The cellular automata rules of Conway's "Game of Life" could be implemented in a neural network, which gives you a small hint of the diversity of activity that can happen in a neural network with loops.
Finitestate machines (FSM), oscillators, volatile memory (in contrast to learning patterns via changing weights), are made possible by looping.
If a network has loops, we cannot update any outputs until we compute all outputs; thus, we require a temporary buffer to store computed outputs until we compute all outputs, and then we can commit the new output values to the neurons in the network.
Any method that updates outputs in the actual network in a progressive way, instead of an allatonce way, introduces an arbitrary ordering in time that leads to chaos.
Physics simulations involving coupled entities, such as planets orbiting a star with mutual gravitational forces between all bodies, require the same kind of approach: compute the net forces on all bodies before updating any velocity and position.
Timedependence:
A simple network simulation typically involves inputs causing the desired outputs after a single simulation time step.
In such a simulation, we think in terms of "number of iterations" rather than "time in seconds".
There need not be any correspondence between iterations and a time scale.
A system might be designed to do a network simulation (iteration) only when new input is available, which might occur at irregular intervals of time.
However, consider a mathematical model of a neuron that attempts to simulate the pulsing output aspect of a biological neuron.
The pulsing might be characterized in terms of time, such as pulsing at a particular frequency or having pulses whose curve extends for a particular amount of time.
We can have other timedependent elements in a mathematical model of a neuron, such as an input accumulator whose value gets contributions from inputs but has a leak proportional to its current value.
In general, we can find an electrical circuit analogy for elements that obey certain mathematical equations, and so one can regard a neuron as a circuit with resistors, capacitors, and a nonlinear amplifier.
Just as a circuit can exhibit complex timedependent behavior, the output of a neuron can be regarded as a function that depends on its inputs and time in a complicated way.
5.9 Backward error propagation
Definition:
"Backward error propagation" is a mathematical procedure that starts with the error at the output of a neural network and propagates this error backwards through the network to yield output error values for all neurons in the network.
Backward error propagation formulae:
The error values at the neural network outputs are computed using the following formula:
Error = (Output  Desired); // Derived from: Output = Desired + Error;
The error accumulation in a neuron body is adjusted according to the output of the neuron body and the output error (specified by links connected to the neuron body).
Each output error value contributes to the error accumulator in the following manner:
ErrorAccumulator += Output * (1  Output) * OutputError;
In a sense, all of the output errors at the next layer leak backwards through the input weights and accumulate at the output of a neuron in a previous layer.
This accumulated value is multiplied by a value that is greatest when the current output of the neuron is most neutral (most "undecided") and is least when the output of the neuron is most extreme (very "certain").
Weight change and bias change formulae:
The basis of learning is the adjustment of weights and bias values in an attempt to reduce future output errors.
Learning "Rate" is a numerical value that essentially indicates how quickly a neuron adjusts weight and bias values according to error values.
The following formula indicates how to change the weights of a neuron with a particular set of input values and its output error value:
Weight += (1) * Rate * Input * Error;
The following formula indicates how to change the bias of a neuron given the current output error for the neuron:
Bias += (1) * Rate * Error;
6. Training a neural network
6.1 Training procedure
One can start with a trained network and continue to reduce output error with further training, but one often starts with an untrained network.
Before training, choose random values for all weights of all neurons in the network.
I observed problems when I randomly selected values in the interval [ 1.0, +1.0 ], and I did not have problems when I selected random values from the interval [ +0.1, +1.0 ].
I mention these observations, but they might be due to my mistakes.
The purpose of random weights is to mitigate the possibility of any pathological situations in a network.
If all neurons in a network started with the same weights, the network would have no basis for increasing differentiation between neurons.
I have observed that setting all bias values to zero (0.0) is acceptable.
A training session involves going through a training set many times, perhaps hundreds or thousands of times.
For each pass through the training set, we consider each item in the training set.
A training set item has a set of inputs, and a set of desired outputs.
We simulate the network, using the set of inputs specified by the training item.
The simulation yields output values.
We propagate the errors backwards throught the neural network to compute the output errors for all neurons.
We update all weights and biases.
Caution: One academic text that discussed neural networks advocated going through the entire training set and only summing up weight changes and biases.
After going through the entire training set we have a set of sums of weight changes and bias changes.
We take these sums and update all weights and biases.
Such sums could be huge for large training sets  and the resulting jump in weightspace would be unreasonably large.
So I think dividing by the number of training items, to get average weight change values and average bias change values, would be reasonable.
There is something appealing about computing a single weight change vector that somehow takes the entire training set in to consideration.
I don't know if I simply made a mistake in implementing the idea, but I nearly gave up entirely on neural networks because of how poorly things were turning out.
Then, when I tried the naive alternative, namely making updates upon every training item, things worked perfectly.
Considering the entire training set before doing an update has some advantages and disadvantages:
Advantage:
Single training items in the training set with extreme error (i.e., bad training item) will not make a big contribution to the update, because it will be overwhelmed by the influence of the "good" data;
Disadvantage:
If N is the number of items in your training set, your rate of progress to the optimal weight vector will be divided by N.
Or, for a given distance you will have only a fraction of direction hints along the way compared to the naive approach;
Perhaps this technique will work for you, but try out the naive approach before you give up on neural networks in utter frustration!
6.2 Failure to reduce error
Training may fail to reduce the overall error for the training set.
It is important to detect a failure to reduce error.
The following list describes causes of failure to reduce error, and possible solutions.
The items in the list are listed in approximate order of probability, with the first item being most probable.
(1) The weight combination has reached a local minimum of the error surface, and is "stuck";
Solution : Start a new simulation with new random weights.
(2) The network has too few neurons or layers to encode all of the patterns in your training set;
Solution : Cautiously entertain the possibility of adding layers or neurons.
(3) One or more items in your training set contradicts or is grossly inconsistent with your other training items;
Solution : Check your data set for irregularities.
Find the test items that yield the most error for your trained network.
Look in to techniques to average weight changes over the entire data set to reduce the influence of any bad cases.
(4) The learning rate is too high (anything over 1.0 is probably excessive), and the updates always overshoot the goal;
Solution : Reduce learning rate.
(5) The learning rate is too low (anything below 0.01 might be too small), and the network really is converging on the ideal weight combination  but is too slow;
Solution : Increase learning rate.
Training a twolayer, threeneuron network to match the exclusiveor (xor) function, can, despite the simplicity of the function, fail to converge.
This can be surprising and frustrating.
However, the solution is to simply set all neuron link weights to new random values and then attempt to train the network again.
In the case of training a network to match the exclusiveor (xor) function, random positive weights seem to lead to successful learning every time, whereas certain combinations of positive and negative weights sometimes cause the training to fail dramatically.
The need to select new random initial weights to recover from a failure to converge is an unfortunate consequence of the combination of the learning procedure.
The learning procedure is essentially searching for a global minimum by steepest descent on a surface, and the potential for the presence of a "local minimum" in which the search can become trapped.
6.3 Overall training error
The overall error of a network can be characterized by the squareroot of the average of squared errors, or "rootmeansquare" (RMS).
The error at any specific network output is given by the following formula:
Error = (Output  Desired);
The sum of squared errors for a single training item is given by the following formula:
double squaredError = 0.0;
foreach (NeuralNetworkOutput output in ListOfOutputs)
{
squaredError += (output.Error * output.Error);
}
The sum of squared errors for the entire set of items in a training set is the sum of squared errors of the individual items. The following code shows how the squared errors for the entire set of training items can be computed:
double squaredError = 0.0;
for
(
int trainingItemIndex = 0;
trainingItemIndex < totalTrainingItems;
trainingItemIndex++
)
{
trainingSet.SetAllInputNodeValues( trainingItemIndex );
Simulate( propagationIterations );
trainingSet.SetAllOutputNodeErrorValues( trainingItemIndex );
PropagateErrors( propagationIterations );
UpdateWeightsAndBiases();
foreach (NeuralNetworkOutput output in ListOfOutputs)
{
squaredError += (output.Error * output.Error);
}
}
The overall rootmeansquare (RMS) of the error is given by the square root of the average of the squared errors:
double rmsError = Math.Sqrt( squaredError / (double)totalTrainingItems );
This value is one way to characterize the overall error of a network considering all training cases.
7. Learning
Learning occurs when the weight and bias values of neuron links and neuron bodies are adjusted in accordance with specified network inputs and the output error values.
Consider a neural network with two inputs (x1 and x2), and two links (with weights w1 and w2), and one neuron body, and one output (y).
Neural network with two inputs, and one neuron body, and one output
We train this neuron by supplying inputs, computing the output, computing the error, computing weight and bias changes, and updating the weights and the bias, arriving at new weights ( w1', w2' ).
There is a very interesting way to visualize this process.
We can regard the set of weights as a vector in a multidimensional space. For example, for two weights we have the vector W = (w1, w2) in a twodimensional "weight space".
When weights are adjusted, we have a new weight vector W' = (w1',w2').
We can visualize this as a point W moving to a new point W' as part of a process to minimize output error.
Normally one wouldn't compute the output error for all possible weight combinations, because the hope is that the weight adjustment process will efficiently head toward the best combination.
However, let us plot the surface that essentially shows how well a neuron satisfies all items in a training set as a function of its two weights:
Sum of squared errors for a specified training set as a function of two weights (w1, w2)
Basically, the goal of learning is to descend to the lowest level of this surface, where error is minimized.
Once we find the point W = (w1, w2) that yields the minimum value on this surface, learning is finished and then we can simply use the trained neuron.
The following graph shows the output of a trained neuron as a function of all possible inputs X = (x1, x2):
Neuron output as a function of two inputs (x1, x2) for a weight combination that minimizes squared error
Even though the weighted sum for this twoinput neuron is simply (w1*x1 + w2*x2), the activation function turns a simple rotated plane in to a cliff.
This surface has the correct output values for all input combinations (x1, x2) specified by our training set.
But you can imagine how input vectors X = (x1, x2) similar to training values would also lead to the proper output values; this feature of neural networks is called "generalization" and is the main value of neural networks.
As we attempt to "descend" the surface of squared error, we must "leap before we look"!
We update the weight vector and bias, and then we evaluate the "height" of the surface at our new location.
One consequence of this is that we might move to a point with a more extreme error.
Another consequence is that it might take a while to descend back to the depth of our previous location.
The possibility of "leaping" to more extreme peaks and valleys of the error surface is directly related to the "learning rate", because the learning rate determines how much influence error values have on our weight and bias changes.
The following graph shows how increasing the learning rate hastens our arrival at lower positions on the squared error surface, where error is minimized.
The graph also shows that increasing the learning rate also introduces the possibility of making bad steps:
Short term trend of rootmeansquared (RMS) error for the entire training set over several training iterations, for learning rates 0.1, 0.5, 1.0, and 2.0.
Here is a graph of the rootmeansquared output error of a multilayer network with a training set with 19386 items that experienced many bad steps on the path to the best weight vectors:
Training sometimes encounters spikes in the root meansquared (RMS) error, when error increases for some iterations before resuming a decreasing trend.
Sometimes the trend is simply smooth convergence to the desired set of weights:
Trend of rootmeansquared (RMS) error for the entire training set over several training iterations, for learning rates 0.1, 0.5, 1.0, and 2.0.
8. Example: Exclusiveor (xor)
"Exclusiveor" (xor) is a function that accepts two Boolean inputs and yields a single Boolean output according to the following table:
X1

X2

Y = xor( X1, X2 )

0

0

0

0

1

1

1

0

1

1

1

0

In general, a single neuron has inputs {x1, x2, ...}, entering through links with weights {w1, w2, ...}.
The neuron computes an intermediate quantity d = bias + (w1*x1 + w2*x2 + ...), which can be regarded as identifying which plane, in an infinite set of parallel planes, contains a specified point with coordinates {x1, x2, ...}.
The neuron computes an output value, y = ActivationFunction( d ), which has the effect of splitting the infinite set of parallel planes in to two sets, with one set producing low output values, and the other set producing high output values.
Thus, a single neuron splits multidimensional space in to two regions, separated by the plane bias + w1*x1 + w2*x2 + ... = 0, and assigns low output values to points in the region on one side of the plane, and assigns high output values to points in the region on the opposite side of the plane.
Thus, if two sets of points in multidimensional space have distinct classifications and can be completely separated by a plane, then a single neuron can be used to correctly classify points from those sets as belonging to one set or the other.
The exclusiveor (xor) function classifies points in twodimensional space (with coordinates (x1, x2)) such that points in the set { (0,0), (1,1) } are classified as producing an output of "0", and points in the set { (0,1), (1,0) } are classified as producing an output of "1".
There is no single "plane" (in this case, a line) that can separate those four points in to the two sets.
Therefore, a single neuron cannot be used to classify points according to the exclusiveor (xor) function.
A single neuron can only split a space of points in to two regions.
The exclusiveor (xor) function classifies points in a manner that essentially divides a twodimensional space in to three regions (or, alternatively, four regions).
Two neurons can split twodimensional space in to three regions (e.g., by two distinct parallel lines), and can thus be used to classify points according to the exclusiveor (xor) function.
A third neuron can be used to combine the outputs of the other two neurons in to a single output.
The following neural network, with two inputs, and two neuron bodies in a first layer, and a single neuron in a second layer, and a single output, can be used to classify points according to the exclusiveor (xor) function.
The following neural network can either be trained to compute the exclusiveor (xor) function, or the neural network can simply have its weight and bias values assigned in a manner that produces the desired behavior.
A neural network capable of classifying points according to exclusiveor (xor)
The computer code associated with this document demonstrates training the neural network shown in the diagram above to match the exclusiveor (xor) function.
The neural networks sometimes fails to learn the function, but the software can simply be restarted to try learning with a new set of initial weights.
If the software successfully learns the exclusiveor (xor) function, then the output resembles the following:
x1 = 0.0000 x2 = 0.0000 y = 0.0172 error = 0.0172
x1 = 1.0000 x2 = 0.0000 y = 0.9802 error = 0.0198
x1 = 0.0000 x2 = 1.0000 y = 0.9839 error = 0.0161
x1 = 1.0000 x2 = 1.0000 y = 0.0154 error = 0.0154
The output (y) is within 2% of the desired value for each of the four combinations of the variables (x1, x2).
Although the network was trained to learn output values for only four combination of variables (with values 0.0 and 1.0, representing Boolean values), the inputs to the neural network can be set to any arbitrary floatingpoint values.
The following image shows the output of the trained neural network for many combinations of input values:
A neural network capable of classifying points according to exclusiveor (xor)
The surface represents the output of the neural network for all possible input combinations (x1, x2) in the ranges [ 2.0, +2.0 ].
The output is close to 0.0 at the lowers areas of the surface, and the output is close to 1.0 at the highest areas of the surface.
Note that the surface is low for at the points { (0,0), (1,1) }, and the surface is high at the points { (0,1), (1,0) }.
The network was only trained to produce desired outputs for four specific combinations of input variables, but the neural network also produces outputs for all other possible combinations of input values.
The ability of neural networks to produce reasonable responses for general cases after being trained for specific cases can be regarded as "generalization".
Any process that fits data points to a model, such as fitting points to a line or other curve, also produces a "generalizing" effect, so the fact that fitting a neural network to produce desired outputs for specific lessons results in a kind of generalization is not extraordinary, but it is nonetheless interesting to observe the ability to generalize from specific cases.
9. Example: Tictactoe ("Naughts and Crosses")
9.1 Introduction
Tictactoe ("Naughts and Crosses") is a simple game played on a 3 * 3 grid of cells that can be marked with "O" or "X".
Players alternately place "O" and "X" marks in unoccupied cells until one of the players completes a row, column, or diagonal.
Because there are 3 rows, and 3 columns, and 2 diagonals, there are eight winning patterns for each player.
TicTacToe board and winning patterns
It is trivial to write a recursive function that explores all possible TicTacToe games, because the maximum duration of the game is nine moves.
At each point in the game we simply examine the results of moving in each of the remaining unoccupied cells.
Such a function can confirm that a TicTacToe game played with "perfect players" will end with no winner.
9.2 Training a neural network to indicate the best moves
A recursive function can explore all possible games and determine the best move for each board configuration.
We add each board configuration (inputs), and the best move (desired outputs), to a list of training items.
We then train the network to produce the desired outputs for each set of inputs.
The network will have 9 inputs corresponding to each cell of the grid, and the input values will be limited to the following values:
0 : Unoccupied cell
+1 : Protagonist player
1 : Opponent player
The network will have 9 outputs corresponding to each cell of the grid, and the output values will be limited to the following values:
0 : Do not move here
1 : Move here
Eight outputs will be set to "0", and one output will be set to "1".
Thus, after training the neural network, a board configuration can be specified as input, and the neural network will indicate the best move.
The output closest to "1" will indicate the best move, and all other outputs should be close to "0".
In general, any function with Boolean parameters and Boolean outputs can be represented by a neural network with two layers of neurons.
The first layer of neurons can divide the multidimensional space in to regions, and the second layer combines the region classifications to produce the appropriate output values.
The TicTacToe neural network produces Boolean outputs, and although the inputs have three states ( 1, 0, +1 ), we could, in princple, convert these few discrete input values to a set of Boolean inputs.
Therefore, two layers of neurons should be sufficient to learn TicTacToe.
Because the network has 9 outputs, there are 9 neuron bodies in the final (second) layer.
The only remaining neural network design decision is deciding the number of neuron bodies to put in the first layer of the neural network.
To make this decision, computer code can generate and train a neural network with N neurons in the first layer.
The ability of the neural network to learn the complete training set for TicTacToe can be graphed.
The following graph shows the overall training set error during training for each of 48 different neural networks, with N = 1,2,...,48 neurons in the first layer.
Overall training set error during training, for N = 1,2,...,48 neurons in the first layer (N = 1 is at the top, and N = 48 is at the bottom, and most intermediate curves are lower for higher values of N)
Another way to visualize this trend is to form a surface from the sequence of curves:
Overall training set error during training, for N = 1,2,...,48 neurons in the first layer (N = 1 is at the back, and N = 48 is at the front)
Thus, we see that as we approach N = 48 neurons in the first layer, the network seems to be able to accept all training cases.
Anything fewer than 48 neurons levels seems insufficient to learn the complete set of cases.
For low numbers of neurons, each additional neuron significantly reduces the overall error.
However, when the number of neurons is close to the number required to learn the entire set of lessons, each additional neuron only reduces the error by a relatively small amount.
The following image shows a neural network with 9 inputs, and 48 neuron bodies in the first layer, and 9 neuron bodies in a second layer, and 9 outputs.
A neural network capable of learning to play tictactoe
The computer code associated with this document includes code to build and train the neural network shown above.
The training set has 4520 training items.
In 200 training iterations (involving 3 propagation steps, for a total of 200 * 4520 * 3 = 2712000 simulation steps and the same number of error propagation steps), the overall error decreased from 1.520 to 0.153.
(Those numbers can vary according to random initial conditions.)
The training required several minutes.
The following are two examples of specified inputs and produced outputs of the trained neural network:
Scenario #1:
Input:
1.0000 1.0000 0.0000
0.0000 1.0000 1.0000
1.0000 0.0000 0.0000
Best move:
0.0001 0.0000 0.0676
0.0001 0.0000 0.0000
0.0000 0.0000 0.9870
Scenario #2:
Input:
1.0000 1.0000 0.0000
1.0000 1.0000 0.0000
0.0000 0.0000 0.0000
Best move:
0.0000 0.0000 0.0859
0.0000 0.0000 0.9819
0.0000 0.0000 0.0000
The network was trained to produce the best moves for the player whose mark corresponds to "+1".
The best move for the opponent player, whose mark corresponds to "1", can be found by multiplying all inputs by (1) before simulating the neural network.
10. Training neural networks
The following is a quote from "Artificial Intelligence" (3rd edition; Addison Wesley; 1993), by Patrick Henry Winston, chapter 22, Learning by Training Neural Nets, p. 468.
NeuralNet Training is an Art
You now know that you face many choices after you decide to work on a problem by training a neural net using back propagation:
* How can you represent information in neural net terms?
How can you use neural net inputs to express what you know?
How can you use neural net outputs to determine what you want to know?
* How many neurons should you have in your neural net?
How many inputs?
How many outputs?
How many weights?
How many hidden layers?
* What rate parameter should you use in the backpropagation formula?
* Should you train your neural net in stages or simultaneously?
The wrong choices lead to poor performance.
A small neural net may not learn what you want it to learn.
A big net will learn slowly, may get stuck on local maxima, and may exhibit overfitting.
A small rate parameter may promote instability or provide poor predictions.
Unfortunately, the proper choices depend on the nature of the samples.
Mathematically, you can view the samples as representative glimpses of a hidden function, with one dimension for each input.
If there are many inputs, the function's multidimensional character makes the function hard to think about and impossible to visualize.
Accordingly, the best guide to your choices is trial and error, buttressed, if possible, by reference to the choices that have worked well in similar problems.
Thus, the successful deployment of neuralnet technology requires time and experience.
Neuralnet experts are artists; they are not mere handbook users.
11. Making input values and output values suitable for a neural network
The mathematical model of a neuron body presented in this document produces an output value using the "logistic sigmoid function" (i.e., ( 1 / (1 + Exp(x)) )).
Therefore, the outputs are limited to the range from 0.0 to 1.0.
So, the mathematical model of a neural network presented in this document can be used directly to learn and produce outputs in the range from 0.0 to 1.0, such as for Boolean outputs, or for continuous outputs limited to part of that range.
However, the network can also be used to learn and produce outputs in any limited range if the outputs are converted to and from the range from 0.0 to 1.0.
For example, suppose the final output values are limited to the range from P to Q, and suppose Z represents an arbitrary final output value limited to the range from P to Q, and suppose Y represents a corresponding neural network output limited to the range from 0.0 to 1.0.
The following formulas can be used to convert between the two types of output values:
Y = (Z  P) / (Q  P); // Q > P; Q != P
Z = P + (Q  P) * Y; // Q > P
Converting a final output to a neural network output is necessary during training.
Converting a neural network output to a final output is necessary during normal simulation.
Inputs to neural networks are not as limited as the outputs, because the links have weight values that can attenuate or amplify input values arbitrarily.
However, input values far outside the range from 1.0 to 1.0 might cause problems, because before the weights have a chance to adjust, the extreme values will first arrive as a parameter to the "logistic sigmoid function" ( 1 / (1 + Exp(x)) ).
Extreme values to that function will make the output of the neuron very close to 0.0 or 1.0, and when errors propagate backward through the network to that neuron, the error will be weighted by (Output * (1  Output)), which will be extremely small.
Therefore, converting original input values to the range from 0.0 to 1.0 before providing those values to the inputs of the neural network is probably a good idea.
Suppose the original input values are limited to the range from A to B, and suppose W represents an arbitrary original input value limited to the range from A to B, and suppose X represents a corresponding neural network input value that we choose to limit to the range from 0.0 to 1.0.
The following formula can be used to convert the original input value to a value appropriate for supplying to the neural network.
(There is no need to convert a neural network input value to a value in the original input range.)
X = (W  A) / (B  A); // B > A; B != A
12. Rough Notes
This section is a collection of miscellaneous notes. I am too busy to
organize these notes right now, but I wanted to share these notes because
I think some of these ideas might be useful to you.
Over the past few years I have received email messages asking about various
aspects of using artificial neural networks. But, every single time I received such an
inquiry I felt obligated to advise learning about
alternatives to neural networks!
If you want to learn about various machine learning algorithms, then I highly
recommend the book "Artificial Intelligence: A Modern Approach (3rd edition)":
http://aima.cs.berkeley.edu/
On Amazon:
http://www.amazon.com/ArtificialIntelligenceModernApproachEdition/dp/0136042597
That book describes all of the modern algorithms in a rigorous way, but also with
important quantitative and qualitative observations about the costs and benefits
of those algorithms, and their capabilities and limitations. Knowing what
algorithms *cannot* accomplish in practice is crucial!
Wikipedia provides an excellent overview of different kinds of
supervised learning algorithms.
http://en.wikipedia.org/wiki/Supervised_learning
Here are a few kinds of supervised learning algorithms:
Artificial neural networks:
http://en.wikipedia.org/wiki/Artificial_neural_network
Decision tree learning:
http://en.wikipedia.org/wiki/Decision_tree_learning
knearest neighbor algorithm:
http://en.wikipedia.org/wiki/Nearest_neighbor_%28pattern_recognition%29
Support Vector Machine (SVM):
http://en.wikipedia.org/wiki/Support_vector_machine
Bayesian inference:
http://en.wikipedia.org/wiki/Bayesian_inference
Hidden Markov Model (HMM)
http://en.wikipedia.org/wiki/Hidden_Markov_model
But, some "learning" or "pattern matching" tasks might be best performed by simple
data structures!
Hash table:
http://en.wikipedia.org/wiki/Hash_table
Bloom filter:
http://en.wikipedia.org/wiki/Bloom_filter
Because of the recent popularity of Support Vector Machine (SVM) applications,
I will offer a few more interesting links:
LIBSVM:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
The "libsvm" library was recently used in a webcamera squirrelseeking water
cannon project:
http://www.slideshare.net/kgrandis/pycon2012militarizingyourbackyardcomputervisionandthesquirrelhordes
Also, you can see that SVM is an active topic of interest today,
just by looking at the number and variety of related posts on StackOverflow, for example:
http://stackoverflow.com/questions/7155055/help100accuracywithlibsvm
(Look at the sidebar with all of the related and recent questions about practical
SVM development.)
SVM has some of the same basic attributes of the "neural network" learning algorithm
 with hyperplanes being found to separate, and hence classify, clusters of data points.
HOWEVER, as the Wikipedia article points out (in the "Issues" section):
"The SVM is only directly applicable for twoclass tasks. Therefore, algorithms that
reduce the multiclass task to several binary problems have to be applied"
For the TicTacToe example in my article, there were 9 inputs (for the current states of
the cells of the 3x3 grid) and 9 outputs (to indicate whether or not the protagonist should
put a mark in each particular cell). But it seems like doing the same thing with SVM would
require 9 independent SVM instances, one for each output, to independently classify the
current input data as implying whether or not the move to that output cell should be done.
But, I haven't had the time to really study SVMs, yet.
I have also read in various places that SVM is extremely fast at learning a set of training
examples, and can also incorporate new lessons rapidly.
Anyhow, I currently know very little about SVMs, but I am wellaware of two major flaws of
the "error backpropagation" learning model presented in my article:
(1) The learning can be very slow; (2) The learning can randomly get stuck in a local
minimum of the "error surface", and thus never reach the optimal learning state (where
the error is minimized). I think SVM learns rapidly, and does not have the localminimum
trap problem; but this is only based on faint recollections of comments I have read about
the algorithm. I apologize for mentioning something about which I am very uncertain,
but it's such an important possibility that I feel obligated to mention it.
Some day I'd like to make the same demonstrations in my "neural network" article with
both Hebbbased neurons (in an associative learning network) and SVMs, so that I could
compare these learning methods for these particular tasks. Clearly, a learning method
should be selected based on the nature of the learning task. After all, a plain array,
or hash table, or bloom filter, or associative array, etc, could all be considered
"learning algorithms" with instant learning ability and 100% precision (but with no
ability to automatically interpolate or extrapolate, or cope with contradictions, etc).
There are also learning algorithms for formal logic inferences (e.g., essentially
databases of precise inferences made from premises and rules of logic), and probabilistic
inferences (i.e., where assumptions, and hence conclusions, are uncertain). So, if
you're proving rigorous math theorems, or making fuzzy inferences (where you can only
give probabilities to various premises), then there are algorithms for these
situations that are very different from neural networks and SVMs.
Yet another exciting area of machine learning is "genetic algorithms":
the idea of constructing little robots to search for better answers in a particular
space, where the construction of each search robot is based on a "genetic" combination
of two or more knownsuccessful robots. That idea is a little bit "meta", because
it involves designing the thing that will design the thing that will find the best
solution, instead of just designing a thing to find the best solution.
In general, we can see that "evolution" eventually found good solutions to complex
problems using "genetic recombination", and we can see that biological neural networks
can do a lot of computing that copes with uncertainty and can discover similarities
between things, but I'm not sure that it's best to imitate these learning methods in
computer algorithms. I think it's becoming apparent that true "intelligent design"
can surpass the ignorant and massivelywasteful random experiments of "evolution", and
avoid the various limitations of biological neural networks (e.g., inability to store
information precisely; inability to assimilate large amounts of information quickly;
risk of similar memories becoming merged; etc).
* The phenomenon you are observing in the XOR learning case is the network state getting
stuck at a point that *seems* to be where the overall learning error is minimized,
but is merely just a "local minimum". Think of the entire network state (i.e., all
link weights, and all neuron body bias values) as being represented by the position of
a little marble rolling on a hilly surface, where the height of the surface represents
"error". The training algorithm essentially provides the "gravity" that guides the
marble downhill  hopefully to a final position where overall error is minimized
(for all of the training examples collectively). If there is a little depression in
that error surface, the marble can get stuck, even when there is a deep valley nearby
(where the error is much lower).
One crude solution is to just put the marble somewhere else on this hilly playground,
and hope it rolls down to a place with lower overall error  and that's why my code
randomizes all weight values and bias values prior to being trained: to give the
marble a fresh chance to find a better solution. So, if you notice the error value
seems to get stuck, then you can just try again.
Now that might seem insane  to have an algorithm that might get stuck at random 
but this is the simplest practical solution without an algorithm that somehow is able
to develop a more global perspective of the error surface (i.e., an algorithm that
would have some hint about how to head toward the actual best answer, instead of relying
on the assumption that all points can roll downhill to the best answer).
One other strategy to cope with the "local minimum" problem is to give the "marble"
a "momentum" aspect  to imitate the properties of a physical marble rolling downhill.
If a rolling marble gains enough speed (and hence momentum), it can roll through small
dips in the landscape without getting stuck, and continue coasting with a better chance
of finding a lower dip in the landscape. (If the low dip it managed to roll past
was in fact the lowest point accessible to the marble, then the marble might roll
beyond it a bit before rolling back because of the "gravity". It might require having
some "air friction" term to make the marble eventually settle in to the dip without
forever oscillating through it or orbiting it.)
The thing to keep in mind is that the "marble" in this case has no way to form any
sort of global picture of the error landscape. The only information the "marble" has
is regarding the slope of the error surface underneath it! A totally local piece of
information. Not a lot to go on. It's like trying to solve a maze by looking through
a little hole, where all one can see is a single wall or corner. Raindrops falling
all over the place can slide down inclines, and accumulate in puddles  and some of
those drops find really low places to go, while others get trapped in high places.
How can that "problem" be solved without some sort of mechanism that is better than
simple gravity and momentum?
* As for the TicTacToe case, the "error (hyper)surface" in the vast space of all
possible weightbias configurations for the particular training data set is probably
very hilly indeed! Still, it's apparent that the hyperspatial marble is able to roll
around the curves and contours to find a fairly low error configuration. But I still
think it's getting stuck at a place that is not optimal  and I think it might take
a lot of random restarts of the training session to pick a starting configuration
with a downhill (if snaking) path to a much lower point on the error surface.
Sure, if there aren't enough neurons in the network to form enough hyperplanes to
subdivide the input combinations in to subsets where all input combinations are
related in some way (depending on what output is desired), then the neural network
cannot adequately represent the complexity of the problem, and cannot reduce the
error below some particular level. (This is just like not being able to represent
a parabola with a formula that is only linear, or a cubic curve with a formula that
is only quadratic. You need more terms, or more control points, to represent
higherorder curves. Likewise, you need more neurons to represent higherorder patterns.)
I think the TicTacToe error is due to getting stuck in a local minimum  because
I think my experiment involving adding more neurons in the intermediate layer seemed
to indicate that there was no benefit to adding more neurons beyond a particular
count (which implies that the network had reached a level of complexity capable of
representing all the patterns, and implies that adding *more* neurons would be
redundant and would risk "overfitting" (overcomplicating the model, leading to false
complexity in interpolated or extrapolated cases)).
The actual point of lowest error might be extremely hard for a rolling marble to
find, especially in a space that is so vast! There are so many places to get stuck,
and even momentum might not help. But...There might be a huge realm of places in
that vast space that are "pretty good". So, I think we both found "pretty good"
places in the TicTacToe error space, and I think lots of training runs with
randomized initial states are likely to find "pretty good" solutions. But finding
much better solutions might require trying a *lot* of random initial values for all
of the weights and biases.
It would be awesome if the marble could just tunnel through anything and move
inevitably toward the best solution. But the only information the marble has is
totally local, and that local information doesn't provide any clue about the "big picture".
If there were some way of converting the whole training data set to a representation
that gave a more global view of the data that unambiguously and reliably gave some
hint of where the true minimum existed, then this would clearly be much better than
relying on local hints!
One idea I wanted to explore, and write about, in relation to my "neural network"
article, was neural networks where propagation time was involved  so, it would
take time for input signals to travel through links, and then through neuron bodies
(and then through subsequent links, etc). In the C# code with the article, I think
I simulate the network for enough time steps to ensure that the effects of current
inputs propagate to the outputs. But, obviously, these propagation times can
actually be exploited to do temporal processing. As mentioned in the article, a
simple oscillator can be implemented by a neuron with its output connected to its
input, where the link has a weight that inverts the signal. Time delays of various
lengths can be implemented by a series of links (or a series of linkandneuron
combinations). The richness of Conway's "Game of Life", or a simple digital
circuit (or even an 8086 CPU), or the learning of temporal patterns, would be
possible by exploiting the propagation times of the links and neurons.
(Digital circuit designers need to count the number of layers of gates between
input and output to determine the amount of time for the final output to be valid.
It would be the same here, with neurons instead of gates.)
Hi, EJ!
Wow! All this time I assumed that the TicTacToe example learned enough to choose
the best move for all 4520 board states, despite the significant overall error,
after the 200 iterations shown in my example code. But, now that I check, I see that
is not the case, just as you indicated.
Of the total of 4520 board states, the following table indicates the number of
"mismatches" (i.e., where the "best move" selected by the neural network differs from
the "best move" specified in the training item) as a function of the number of
training iterations (where a single iteration means training the network for all
4520 items a single time):
training
iterations Error mismatches
0 4226
1 1.5117 3181
2 0.8104 3181
4 0.8086 3181
8 0.6721 2754
16 0.4334 538
16 0.4636 712
16 0.5029 966
32 0.2904 195
32 0.3137 287
32 0.3369 333
64 0.2122 117
64 0.2707 248
64 0.2683 254
128 0.1260 26
128 0.1628 91
128 0.2449 232
(My original demo trained to 200 iterations)
200 0.1531 90
200 0.1585 97
200 0.2343 225
256 0.1481 88
256 0.2285 221
256 0.2291 226
512 0.0751 17
512 0.1437 88
512 0.1438 88 (amazingly similar to 0.1437 error case!)
1024 0.0278 0 (Yay! Totally Awesome! The TicTacToe MASTER!!1!)
1024 0.1411 88 (yikes! no better than some of the 512 cases)
1024 0.1420 89 (Lame!)
1024 0.1767 134 (Whoa, dude! This marble got trapped in a lame place!)
I repeated some of the measurements to get a sense of the variability of the error
and mismatch total for a given number of training iterations.
Anyhow, clearly my default of 200 iterations in the example code is not enough
to eliminate mismatches! The mismatches range from (90/4520)=2% to (225/4520)=5%
in the three trials shown above.
But notice the stunning case of the low error of only 26 mismatches after only
128 iterations! That's only (26/4520) = 0.6% mismatches overall.
(I wish I had continued that particular run beyond 128 iterations, but I was just
doing these trials ad hoc, and stopped after the specified number of iterations
in each case; but this does mean that each example above is unrelated to any other
example in the table above, which is more revealing than if they were parts of the
same runs.)
With 512 iterations I encountered a case with only 17 mismatches; (17/4520) = 0.38% mismatches.
I found that 1024 *can* suffice to eliminate all mismatches, but I don't know
how likely/improbable it is to eliminate all mismatches in a reasonable number
of iterations. It's tempting to modify the demo program to have the *option*
of continuing to iterate until there are no mismatches. But, I wonder how many
iterations that would *typically* require. Also, it seems like there will be
many cases where even a ridiculous number of iterations would fail to make the
network reach a point of zero mismatches.
This experience also convinces me that a graph showing hundreds of training
error curves (as a function of iteration index), where each curve is for a
different random initial network state, would be very informative. A person
could see, at a glance, the kind of variability possible for training the
exact same network topology, but with different random initial network
weightbias combinations. Also, it would be interesting to see how often
radical jumps occur (as when the rolling marble falls off a ledge to a much
more optimal state, or is misled to hopping UP upon nearby higher ledge!).
OK...This makes me think that yet another interesting graph would be a histogram
of the number of cases of reaching zero mismatches for a given number of training
iterations. So, this histogram would probably have few or no nonzero counts for
bins for fewer than maybe 700 iterations, and maybe very few for above 1600
iterations (just guessing wildly), and maybe there would be a bell curve in the
middle of that range. I'm just guessing. Also, it's possible that some networks
will be truly stuck, even after 5000 iterations. Perhaps those would be
accumulated in a special "epic fail" bin. I think the ratio of "fail" runs to
"total" runs would be an interesting statistic in itself.
Thank you for prompting me to look at this stuff again. I feel bad that my
example program didn't do enough iterations to *reasonably* expect zero mismatches
for all 4520 training items. I wonder if I already knew that when I made the
demo code in the first place, and just didn't want the user to wait forever for
the program to show concrete results. (After all, 200 iterations take
2 minutes 25 seconds on my 3.33 GHz Core i7 CPU. It must have taken much
longer on the computer I used when I first wrote that code!)
Yes, the minmax recursive search for optimal moves arbitrarily considers
the upperleft cell as a candidate move before any of the other cells of the
board, and thus, arbitrarily, results in that corner being selected as the
optimal move for an empty board. Of course, from symmetry, all four corners
of the board are equivalent in that case.
This brings up an interesting possibility. As things are now, the network is
trained with a *single* "best move" for any given board configuration, even
when there might be *multiple* equallybest move options. This is acceptably
sufficient for optimal game play, assuming the network has successfully learned
all of these "best moves". But perhaps there would be some benefit to training
the network with all of the "equivalent" best moves, too. So, for an empty board,
all four corners would be given an equal (positive) score. The game play would
not be improved from a win/draw/lose standpoint, but perhaps the fact that the
training items reflected the equivalence of best moves (for configurations where
there is more than one best move option) would make the neural network better
reflect the "reality" of TicTacToe; i.e., the network topology implied by the
link weights would have a symmetry that embodied TicTacToe more naturally.
Perhaps the overall training error would be reduced, because less of the network
would need to be devoted to supporting the *arbitrary* choice of "best move" in
some cases, whereas training the network with data with all the symmetry that
exists in reality might reduce the "effort" to encode the lessons involving
subsequent moves. (And, if it's true that the current trained network doesn't
make the best move in all possible configurations at this point, then perhaps
this modified training example set would be just the thing to reduce the training
error enough for the network to make best choices for all possible configurations.)
Your comments also almost directly suggest the aesthetic benefit of somehow giving
the neural network the ability to randomly choose from among the best choices,
to introduce variety in to game play. I am not sure how this can be implemented.
Yes, the network could be trained in a way such that all four corners would be
considered "best moves" on an empty board, for example, but the initial random
configuration of network weights and biases would inevitably lead to the trained
network giving slightly different outputs for those four corners, and only one of
those four floatingpoint numbers would be the maximum. The 9 output values could
be sorted, and perhaps only numbers within 5% of the maximum would be considered,
but choosing anything other than the actual maximum output value introduces the risk
that the selected move is actually inferior. (Of course the trained network, and
this 5% criteria, could be tested completely to ensure that the criteria always
happens to work, but this would all be a rather artificial or meaningless demonstration.)
By the way, thank you for the compliment of my TicTacToe game tree recursive
search algorithm. I normally don't write code that is so terse and cryptic, but
I wanted reallycondensed code for that example, so that the code wouldn't distract
from the moreimportant neural network code that I sought to demonstrate.
After getting your note, I downloaded the code from my web site to look at it again,
and, once again, I was spared a lot of effort by reading my own comments,
reinforcing my longheld belief in the virtue of detailed comments.
However, even so, I generally avoid coding in such a compact, cryptic manner.
There is some value in reducing the size of code on a screen (or a page), such
as enabling the reader to see more or all parts of the code at once, but all of
the time a person needs to interpret/decode the meaning of the code is a big waste
(and introduces risk of misunderstanding). I'm a big fan of the "Don't Make Me Think"
philosophy (as in *superfluous/wasted* thinking effort, not as in the thinking
that might be needed to change/learn/grow).
Your idea of essentially permuting the "priority" of cells when training
different networks (e.g., {0,1,..,8} for one network, and {1,..,8,0} for another
network, etc) might encounter a permutation that happens to make it easier for
the network to learn the entire set of training examples. I think the idea I
described earlier, where *all* optimal moves are included in the outputs
suggested for a given input board configuration, instead of one arbitrarilyselected
"best" move (something your cell priority permuting idea would partly address),
might create a more "realistic" symmetry in the function that the network is
being trained to fit, and might reduce the overall training error by eliminating
the arbitrary exclusion of equallyacceptable moves (and the likely strain
this arbitrary exclusion, and imbalance, puts on the curvefitting aspect of
the network).
Just to be clear, given the board configuration for an empty board:
0, 0, 0,
0, 0, 0,
0, 0, 0
my idea is to train the network to produce the following output:
1, 0, 1,
0, 0, 0,
1, 0, 1
instead of what is currently specified for that case:
1, 0, 0,
0, 0, 0,
0, 0, 0
The opportunities for multiple "best moves" for a given board configuration rapidly
decrease as more moves are made, because the moves tend to break the
symmetry of the board.
I've attached the modified neural network code  hacked to do 1024 iterations,
and check all 4520 training examples for mismatches between the neural network
"best move" selection and the training item "best move" choice. Also, I've added a
few explicit test board configuration scenarios to the two that were included in the
original code, along with a comparison with the corresponding training item, so
that a person can visually compare the neural network output with the corresponding
training item. (I copied the text of my two email responses to your notes to
the file Notes.txt in that project so that I will remember some of the modifications
I want to make to the article.)
There's one other topic that you were considering which seems interesting to me:
The possibility that the various trial runs are getting stuck in similar traps.
Obviously there are many configurations that are essentially the same  differing
only by a rotation of the board, or by rotations of the mirror image of the board.
(A transpose along the diagonal is just a horizontal flip followed by a counterclockwise
quarter turn. And a combination of a horizontal flip and a vertical flip is just a
rotation by a half turn. So, I think the 4 rotation cases, plus the 4 rotated flip
cases, cover all the simple geometric symmetry cases.) Some of those cases will have
some degeneracy, where the geometrically transformed version happens to match one
of the other geometrically transformed cases; e.g., if there is only a mark in the
center, all 8 geometric transformations result in that same configuration; and if
there is only a mark in a corner, then there are only 4 distinct configurations.
But, for the asymmetric configurations, there can be 8 total boards sharing the
same kind of strategic nature.
The way I'm training the network now, where I arbitrarily choose only one of the
possibly several optimal moves, the symmetry I mention in the previous paragraph
would likely not manifest in the trained neural network. And, as I speculated
earlier in this note, I think it might actually benefit the training of the neural
network to train it such that for each given input board configuration ALL optimal
moves are presented as desired outputs. So, for the empty board, ALL four corners
should be simultaneously recommended by the neural network. I think this will
create a symmetry in the function that the network will be trained to fit, and I
think this might accelerate the training, or make the training much more likely
to successfully find a place where there will be zero mismatches in a relatively
few number of training iterations (e.g., perhaps reaching this condition would
routinely take fewer than 1024 iterations, instead of being relatively unlikely).
Anyhow, there seem to be specific error plateaus that keep coming up across
independent training runs (with random initial states). It's difficult to
visualize the hyperspatial landscape the marble is rolling on  because there
are 9 inputs linked to 48 neurons (for 9*48=432 weights), and 48 bias values
in those 48 neurons, and 48 links to each of 9 neurons (for 48*9=432 weights),
and 9 bias values for those 9 neurons, for a grand total of
(432 + 48 + 432 + 9) = 921 independent parameters defining the overall network
(assuming the specific topology). That means the marble is drifting through
a 921dimensional space! That's a lot of coordinates!
But, really, it seems like each of the four corner neurons should essentially
have the same "logic"; i.e., if the board looks a particular way, relative to
the location of that corner, then the corner output neuron should recommend
moving there, or not. Likewise, each of the four edge output neurons (i.e.,
not a corner, and not the center) should have the same exact "logic" (where
the board is viewed relative to the location of each of those output neurons).
Then there's the center output neuron, which has a lot of degeneracy when
considering the various configurations relative to the center location.
So, it seems like there are only really THREE totally independent training
situations going on: CENTER, CORNER, EDGE. These are totally independent
training cases, but they will end up relying on the same 48neuron first layer,
which does make the training of each of those general cases affect each other,
but eventually the first layer has to harmonize it's input weights and biases
to give information that supports the needs of the output layer neurons. It's a
funny emergent pattern, but the fact that I was able to train the network to
agree with all 4520 cases means that the network has enough free parameters to
learn everything without internal conflict making it impossible to reconcile
different needs.
13. References
The following books may be useful in understanding neural networks and their applications.
[1] Artificial Intelligence: A Modern Approach (3rd edition)
Online: http://aima.cs.berkeley.edu/
On Amazon: http://www.amazon.com/ArtificialIntelligenceModernApproachEdition/dp/0136042597
If you read only one book on artificial intelligence, this should be the book!
[2] Artificial Intelligence (3rd ed) (Patrick Henry Winston; AddisonWesley; 1993)
This book is very interesting and useful.
The author describes many fascinating topics, with examples, with enough detail to enable you to use the concepts in practical situations.
The author describes neural networks and "back propagation" learning in significant detail, including mathematical derivations.
[3] Artificial Intelligence: A New Synthesis (Nils J Nilsson; Morgan Kaufmann; 1998)
This book, like reference [1], is a fascinating and practical introduction to Artificial Intelligence.
However, this book is slightly more concise than reference [1].
I think a person would benefit from reading both [1] and [2].
I would only recommend this book if you have a strong interest in all of Artifical Intelligence, whereas I recommend reference [1] even to those who are specifically interested in learning about neural networks.
[4] Focus on AI (Prima Tech / Prima Publishing)
Part of the game development series edited by Andre LaMothe.
This book is useful to game programmers, and is filled with computer code examples that demonstrate neural networks, genetic algorithms, and fuzzy logic, all applied to game character behavior.