A Neural Multiplier

A description of my neural network splash image and how it works.
Published

August 22, 2025

I often like to include a splash image on my web pages - something that looks kinda cool to capture the attention, makes you wonder a little bit what’s going on, and often has a little interactivity to suck you in just a bit more. Here’s a closer look at the Fall 2025 splash image from my main academic web page at marksmath.org.

The image is a neural network that performs a specific computation. You can hover over the nodes to find the values for each of them; you might pay particular attention to the two input nodes at the bottom and the one output node at the top. Perhaps, you can guess the computation based on that, though, the title of this post might give it away as well. You can also hit the redraw button, as the inputs are randomly generated with each redraw.

The purpose of this post is to explain what the neural network does, roughly how it works, and how it was built.

What’s the computation?

The neural network illustrated on this page multiplies a pair of numbers. For example, the network currently indicates that

Note that the computation is approximate. In reality,

So, the computation is imperfect but, still, I think it makes for a nice elementary example of how you can build a neural network to perform (approximately) just about any computation you want.

How’s it work?

From here, the technical level of the discussion moves up a few levels.

How do neural networks generally work?

The drawing represents the most basic type of neural network, sometimes called a feedforward neural network. Each circle is called a node. The nodes are arranged in layers from bottom to top (though, the layers are more conventionally laid out from left to right). The bottom two nodes form the input layer and are assigned randomly chosen values \(x_1\) and \(x_2\); these are the numbers we want to multiply.

As we move from bottom to top, we see three more layers. The final layer contains one node, namely the output node. We’ll denote its value \(y\), which will be computed via the feedforward technique and is supposed to represent the result of the computation.

The middle two layers are called hidden layers and have six and four nodes respectively. Let’s denote the values of those nodes \[a_i \text{ for } i=1,\ldots,6 \text{ and } b_j \text{ for } j=1,\ldots,4.\] The main question now is, how do we compute the values of the subsequent nodes from the input nodes? The answer requires a bit more notation, the first piece of which involves the edges that connect the nodes from one layer to the next. Associated with each of these edges is a weight, which is just a numeric parameter.

To compute the values in the hidden layer, use the formula

\[ a_{j} = g_1\left(\sum_{i=1}^2 w_{i,j} x_i + c_{1,j}\right) \]

In this formula, each \(c_{1,j}\) is another numeric parameter associated with the nodes; it simply shifts the linear transformation produced by the sum to produce an affine transformation. Often each \(c_{1,j}\) is called a bias associated with the node.

The function \(g_1\) is called an activation function. This can be just about any function we want, though there are certainly common choices. Crucially, they are generally chosen to be non-linear. They are really the only part of this whole process that lies outside the scope of linear algebra. Without a non-linear activation function, this whole process boils down to an overly complicated way to perform linear regression.

Well, that all seems complicated enough! Once we’ve got it down, though, we can move through the next two layers in the same fashion. Thus, each value \(b_j\) in the next layer can be computed as

\[ b_{j} = g_2\left(\sum_{i=1}^6 w_{i,j} a_i + c_{2,j}\right) \]

and the output can be computed as

\[ y = \sum_{i=1}^4 w_{i,j} b_i + c_{3,j}. \]

I’ve left off the activation function in the final layer because I’ve chosen it to be just the identity function anyway. I will choose the other activation functions \(g_1\) and \(g_2\) to be the hyperbolic tangent function. This is a common choice when you want the outputs to lie in a bounded interval. Of course, the product of two numbers between \(-1\) and \(1\) is again between \(-1\) and \(1\).

Finding optimal parameters

The setup to this point includes a number of choices like the number of layers, the number of nodes per layer, and the activation functions. These are design choices that are typically guided by general principles and experimentation. Of course, my major design choice was that I wanted something that looks cool. Not common.

The weights on the edges and the bias per node are different. These are unspecified numeric parameters and we typically go through an optimization procedure to find values of these so that the computation works relatively well. The general approach to doing this in machine learning is to feed the network a bunch of data and optimize it for that data. For the problem at hand, the data might look something like so:

x1 x2 y
-0.392 0.806 -0.315952
0.485 0.073 0.035405
-0.254 0.363 -0.092202
0.821 -0.159 -0.130539
-0.227 -0.889 0.201803

Thus, the data is literally a list of samples. The idea is to choose the parameters (the weights and biases) to minimize the overall error. There are lots of ways to potentially measure error. This example is simple enough that total squared error should work well; that is, we take the error to be \[ \text{error} = \sum \left(y_{\text{predicted}} - y_{\text{actual}}\right)^2 \] Note that \(y_{\text{predicted}}\) is a function of the parameters and the sum is taken over all the data in the sample. We choose parameters that make this error as small as we can.

Code

As complicated as this might all seem, today’s high-quality, machine learning libraries make it amazingly easy these things in practice. I built the network illustrated here using Python code that you can execute in this Colab notebook. Here are the critical steps:

Define the structure of the network

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input

model = Sequential([
    Input(shape=(2,)),
    Dense(6, activation='tanh'),
    Dense(4, activation='tanh'),
    Dense(1, activation='linear')
])
model.compile(optimizer='adam', loss='mse')

Note that the model is defined as a sequence of layers. There’s an input layer of shape 2 and three “Dense” layers with exactly the size and activation functions we’ve discussed. The last line specifies how the model should be optimized

Build the data and fit the model

import numpy as np
X = np.random.uniform(-1, 1, (10000, 2))
y = (X[:, 0] * X[:, 1]).reshape(-1, 1)
model.fit(X, y, epochs=100, batch_size=32, verbose=2)

This bit of code builds input data with 10,000 samples and fits the model to it.

Have fun!

At this point, the model is really built and ready to use. Within Python or the Colab notebook, you could test the model using a command like so:

model.predict(np.array([[1,1], [0.1,0.5], [0.5,0.5]]))

Exporting the model to a form that’s appropriate for visualization is another issue. You can check the Colab notebook if you’re curious.

Interpreting and drawing the model is a further matter. That requires an implementation of the basic feedforward technique in Javascript as well as some drawing tools. You can check my Github to see that, if you like.

A simpler model

Finally, it’s worth mentioning that there’s a much simpler neural network that we can use to multiply any two numbers and get exact results. The network diagram looks like so:

Note that the edges are labeled with their weights, each of which is \(\pm1\). The activation function for the hidden layer is just the square function, \(a\to a^2\), and the activation function for the result is the linear function \(y\to \frac{1}{4}y\). Finally, the bias for each node is zero.

Now, if we simply apply the feedforward technique using these definitions, we find that the network transforms the inputs \(x_1\) and \(x_2\) into \[ \frac{1}{4}\left((x_1+x_2)^2 - (x_1-x_2)^2\right). \]

Now, if you expand that out, I think you’ll find that you get \(x_1x_2\)!


So, why’d I go to the trouble to develop the larger, less accurate version?

Because it looks cool!

I guess it’s also nice that we’ve produced a relatively simple example of a feedforward neural network and illustrated how to train it with manufactured data to multiply numbers.

Comments

Anyone with a BlueSky account can leave comments on this site. Just hit the Reply on Bluesky button and post on that BlueSky discussion.