Morphogenic network
From Wikinfo
A morphogenic network is a self-organizing neural network that organizes with respect to an adaptive tensor field : a continuous information space (see information geometry), covariant with physical space.
(See: Morphogenic network/Inspiration for morphogenic network for an explanation of the reasoning behind the network architecture.)
Contents |
Notation Legend
- e(x) is the energy of neuron x
- v(x) is the output value of neuron x
- v*(x) is the output value of neuron x after going through all inline unary operators
- w(x,y) is the weight of the output of neuron y, after going through all inline unary operators, on the output of neuron x
- ey(x) or e(y,x) is the energy of neuron x propagated to neuron y
Unless otherwise specified.
Basic operation of a single neuron
The output of a neuron in a morphogenic network is a linear combination (or "weighted sum") of its inputs. Each input, v*(y), has an adaptive weight, w(x,y). The output of a neuron, v(x), is the sum of the inputs times their respective weights:
- <math>v(x) = {\sum_{y}} w(x,y)v^{*}(y)</math>
Post-neuron unary operators
Neurons may be separated by a number of in-line (unary) operations on the output. Some practical such operators are:
- integration (over time)
- An integrator performs temporal bit-voting (Another way to look at this is that the integration field decomposes the output field over the time continuum, into temporal eigenfunctions).
- natural log
- exponential
- sigmoid function
- This is the most common.
In any case, to be computative, a neural network requires nonlinear operators, with both positive and negative information divergence. What is meant by "information divergence"? If the operator is thought of as an optical lens, the information divergence is the natural log of the magnification factor of the part of the lens that the signal is focused through. The magnification factor of the lens is the rate of change of the output signal with respect to the rate of change of the input signal. That is, information divergence = c * ln ( d out / d in).
Feedback (free energy flow)
Feedback is recieved from an external system. The external system measures the performance of the network according to a predetermined criteria and sends feedback to the output neurons proportionally to this performance.
The voltage at the feedback points represents flux of Gibbs free energy, e. (or the divergence of the force vector field in state space, + being source, - being sink)
Each neuron distributes its free energy among it's input neurons, proportionally to the contribution (weight*signal) of the input w(x,y)v*(y).
Why? The rate of change in e(y) with respect to time is the energy produced by neuron y. The energy produced by neuron y is the sum of all its energy contributions to other neurons.
or, more succinctly,
Spatial morphogenesis
Calculating information distance
The information deviation, i, equals how much state space there is between two neurons for in which free energy to vary. That is, say, for each bit of deviation, there is a centimeter of space wherein lies an energy gradient. As to know the exact color of a pixel in a unit of space on a computer monitor is to have 24 bits of information, to know the slope of the energy gradient in a �centimeter� of space is to have a bit of information. The information deviation is, analogously, how many pixels are between the two neurons. If it is high, one neuron is a car, while the other is a refrigerator, or they are both voices discussing two different subjects.
The information deviation is calculated from the output signals of the neurons. It is the standard deviation of the difference between the rate of change of the logarithm of the signals (v(x,t)) with respect to time (t).
If, for example, v(x,t) and v(y,t) are the values of a variable in a pair of identical strange attractors, i measures the exponential rate at which they diverge; the rate at which one attractor loses its ability to predict the values of the other attractor�s variables � a loss of information that occurs at a constant bit rate proportional to i. (Hence the formula).
Minimizing free energy
The slope of the free energy gradient between two neurons is simply the difference in free energy between the two neurons divided by the information deviance between them. Thus, the local free energy gradient,
- <math>v = \partial e / \partial i = \Delta e / \Delta i</math>
The weight space is embedded in the state space, and assumed to be locally linear in that space, so the change in weight value, w, per change in state (a.k.a., local weight gradient) is
- <math>\partial w / \partial i = \Delta w / \Delta i</math>
The weights in a neuron will flow thru state space to a lower free-energy state, at a bit-rate proportional to the local free energy gradient:
- <math>\partial i / \partial w * \partial w / \partial t = -cv</math>
Thus, a weight will vary at a rate of
- <math>\partial w / \partial t = \partial w / \partial i * -cv</math>
or, more articulately:
- <math>\partial w / \partial t = -c * \Delta w / \Delta i * \Delta e / \Delta i</math>
The rate at which a weight changes is the sum of all contributions.
- If a neuron does not recieve input from the source in question, it may still contribute to the weight adaption of a neuron that recieves input from the source, however, the weight value for said contributing neuron is de facto fixed at zero. Thus, the contributing neuron would either suppress input from the source (attract the weight to zero - stabilize) or enhance it (repel the weight from zero - destabilize), depending on whether it had a lower energy or higher energy than that neuron, respecively.
This rate represents the mean; the expectation, of change in the weight per change in time, of all possible evolutions of the weight through a priori probability space. (In accordance with E. T. Jaynes' principle of maximum entropy.)
The units of the constant c are bits per second per volt.
- <math>c = -\partial i / \partial t / v</math>
This learning method may be considered Hebbian, and/or as a form of virtual NeuroEvolution of Augmented Topologies.
"Minimizing free energy" refers to the classical thermodynamics perspective of physics. A more apt information-theoretic perspective would be "maximizing cross-entropy" or "minimizing negentropy" between that system whose Lyapunov function is approximated by the feedback sent to the neural network, and the system being controlled by the neural network.
Temporal morphogenesis
Optionally, a neuron may adjust its weights in favor of a linear combination of its inputs that dissipates more free energy.
The rate of change of the a posteri weighted input is proportional to the rate of change of the a priori weighted input times the energy flux from the neuron to that input.
Modified Temporal Morphogenesis
The temporal morphogenesis formula is designed to be analogous to the spatial morphogenesis formula, where the gradients are over time, rather than space.
- <math>e(z) - e(x) -> e(y,x,t) - e(y,x,t-dt) = de(y,x)/dt</math>
- <math>w(z,y) - w(x,y) -> w(x,y,t)v^{*} (y,t) - w(x,y,t-dt)v^{*} (y,t-dt) = d[w(x,y)v^{*} (y)]/dt</math>
- <math>i(x,y) -> 1</math>
- Also, <math>dw(x,y)/dt -> d[w(x,y)v^{*} (y)]/dt</math>
A more proper analog for the i(x,y) term might be
- <math>i(x,y) -> da(x)/dt</math>
Where da(x)/dt represents the bit rate of adaptation of neuron x, i.e. its information gradient in the time domain, discussed below.
Auto Annealing
A neuron's weights may become more stable or less stable, based on the neuron's free energy. Using annealing, when the neuron is closer to "equilibrium" (low free energy), the neuron's weights change slower.
Where e(x) is the energy of neuron x, v(x) is the value of neuron x, and w(x,y) is the weights of the output of neuron y on the output of neuron x (That is, <math> \partial v(x) / \partial v^{*}(y)</math>),
- anneal = ee(x)
- d' w(x,y) = d w(x,y) * anneal
Dynamic Equilibrium
Dynamic equilibrium, as refered to herein, is when the rate of adaptation of a neuron matches the rate of (informational/architectual) change of it's "optimal" configuration. (It's optimal configuration is assumed to be dynamic because the system that the neural network is operating on is assumed to be astable (non-equilibrium).)
Auto-annealing matches the second moment (variance) of the expected probability distribution of evolutions (i.e. differential paths) of a neuron's configuration w/the variance of the local a posteri probability distribution of change in the phase space location of the local energy minimum (i.e. the optimal dynamics of the network + controlled system assemblage), thereby achieving dynamic equilibrium.
The variance of the expected probability distribution of evolutions of the neuron's configuration is hereafter refered to as the rate of adaptation. The variance of the a posteri probability distribution of change in the optimal dynamics of the network + controlled system assemblage is hereafter refered to as the information production rate in the source. Dynamic equilibrium, more concisely, is the ratio of the rate of adaptation to the the rate of information production in the source, that a neuron asymptotically approaches.
This ratio may not be unity(1), but advantages lie in that (a) the adaptation rate is covariant w/ the e.p. rate in the source, and in that (b) the relative adaption rates over the space of neurons is covariant w/the e.p. rates over the space of e. sources. Thus, new (produced) information is distributed among the neurons respective to their relative adaption rates. That is, new information is absorbed more by those neurons that "should" adapt, than those that "shouldn't", in approximatly due proportion; the network is flexible where it needs to be, and sturdy where it needs to be.
Optionally, the annealing formula may be:
- anneal = etime_avg( d e(x) )
- *-time_avg( d e(x)) may be thought of as the time-averaged rate of information entropy production, and the formula may thus be related to the Fluctuation Theorem in nonequilibrium statistical mechanics: Where pr is the probability of evolution on differential path r,
- pr = ec * t * time_avg( thermodynamic entropy production(r))
- Since information entropy production is the additive inverse of thermodynamic entropy production,
- pr = e-c * t * time_avg( information entropy production(r))
- (This is a truism, as I(r) = -ln[p(r)].) Since the standard deviation of a normal distribution is proportional to the multiplicative inverse of the probability density of the mean, if r is the null path (no evolution),
- SD = k * ec * t * time_avg( information entropy production(r)) = k * etime_avg( d e(x) )
- And the average rate of information divergence is proportional to the standard deviation of the probability distribution of evolution vectors (where the magnitude of the vector represents the bit-rate). The anneal cofactor represents the local average rate of information divergence (local time rate; dt).
- or -
- anneal = ed e(x)
All of which are essentially the same equation, with different time average rate parameter. The first one, ee(x) has an infinite time average rate, the second one, etime_avg( d e(x)), has a finite time average rate, while the third, ed e(x), has a null (zero) time average rate.
From the perspective of artifical genetic evolution (such as the genetic algorithm), the anneal multiplier can be thought of as the genetic drift rate.
Summary
Ideally, the neurons operate according to these equations:
- <math>v(x) = {\sum_{y}} w(x,y)v^{*}(y)</math>
Adjustment to a priori energy level probabilities
The above summary tacitly assumes that all energy levels have an equal probability. Any a priori information helps improve the performance of the network. The system is expected to asymptotically approach the energy minima. Knowledge of the distribution of the energy minima may improve the system's performance, by providing a priori information about evolution probabilities. Assuming that lower energy level states are less diverse (varatious), while higher energy states are more diverse (varatious), with a log-normal statistical relationship, multiplying the anneal parameter by the exponential of the energy level yields the corresponding correction for the a priori evolution probability field:
- d a'(x) / dt = d a(x) / dt * ee(x)
This provides the additional advantage of providing a means to control the adaption rate of the network: By biasing the feedback, the network can be made more or less stable.
Topology
Generally connected
The simplest topology for a neural network is a generally connected network. In a generally connected network, every neuron is connected to every neuron. In this case, each neuron recieves the maximum information to adapt its weights. Furthermore, since if a weight is adapted to zero, that connection is effectively "severed", any topology can be simulated by a generally connected network.
However, a generally connected network with N neurons has N2 connections. Any other topology has fewer connections per neuron, and thus uses less circuitry, and thus can have more neurons per unit circuitry. Thus, less generally connected networks may be more optimal for a given problem.
The topology of the network determines what weights, by not being connected, are fixed at zero (i.e. are null).
Criteria for topology
The physical connectedness (topology) determines the morphospace of the logical (virtual) connectedness of the network. That is, depending on how the network is connected, it will be easier for it to simulate some networks and harder for it to simulate others; the topology determines the relative complexity or negentropy of a given configuration. That is the topology determines, the amount of work that needs to be done for the network to evolve into a given logical configuration. To put it yet another way, it determines the topology of phase space. And to put it one more way, it determines the a priori (i.e. epistemic) probability distribution of possible configurations.
Thus, an optimal physical topology would be a statistical ensemble of the networks that the neural network is expected to simulate, therefore minimizing the expected work. That is, the optimal topology is that where the expected aggregate evolutionary path length (in bits) of a configuration, from the initial state (possibly a random state), is proportional to the log-likelihood of that configuration being "ideal" (being the one that minimizes free energy).
Layering
Each layer of neurons is a tesselated grid of neurons recieving input from below and sending it's output above. Neighboring neurons share inputs, and thus they share parameters. Feedback, or "free energy", flows in the opposite direction: It is recieved from above and sent below.
Optimal topology for general computation
Ideally, the network would be scale free.
(heriarchial / self-similiar: each neuron has 2 up connections and 4 down connections) (or small-world network?)

