"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
Dec. 27, 2020
Author - manisar
Machine learning is about prediction.
In a sense, all of computer programming is that - in pretty much all the code that we write, we want to get to a value (or a set of values) by feeding input to our code. But, since we use deterministic formulae to calculate the exact results, we generally don't call it prediction. In such cases, there exists a perfect, known relationship between the input and the output, and formulae (for getting the output) are analytically derived (i.e. formulated using pure algebra) with all their coefficients known in advance.
The term prediction is used when there doesn't exist a perfect or a known relationship between the input and output, and we do interesting stuff to get the results. Generally, we engage one of the two strategies:
For example, a perfect formula exists for getting temperature values in Celsius units by feeding in Fahrenheit data, and vice versa. In such cases we do not need any examples or what is known as the training data. But we are not that lucky in most cases. Consider the following.
Let's say we have a set of values of an independent variable $x$ - the carpet area of houses, and a dependent variable $y$ - the price of the house. What this means is that we have a few examples at hand and we also know that the price of the house depends upon the area of the house in some way. But other than that, there does not exist any perfect formula for this relationship. Nevertheless, it'll be of great value if can use this set somehow to determine the price of any house based on its carpet area. Here, the set of the given independent variables along with the corresponding dependent variables is the training set. By any house, we mean houses with areas both within and outside the range of the areas included in the training set.
And, it doesn't matter if there is actually no such perfect formula in the universe or if we are just not there yet. The fact of the matter is that at present we do not have any perfect formula at hand that underlines this relationship, but we need to be able to predict the price of any house given its carpet area.
In such cases, in order to be able to interpolate and extrapolate, the analyst starts with assuming a function that connects the dependent and independent variables. The idea is to guess this function as closely as possible. Closely here means that for the independent variables in the training set, the values predicted by our assumed function should be close to the actual values of the dependent variables in the training set. It's important to note that the actual relationship between the dependent and the independent variables can be extremely complex - even in the case of only one dependent and independent variable.
So, basically there are two things to guess - the shape of the function and the coefficients. Finding the shape is like finding the underlying basic (say, the universal) relationship (between the dependent and the independent variables), e.g. finding that gravity and distance are related by the inverse square relationship. And, finding the coefficients is finding how this relationship is tuned to the data at our disposal, i.e. finding the gravitation constant ($G$) in previous example.
Coming back to guessing the shape (and the coefficients), brute force is obviously useless here.
As we shall see, in both traditional (regression based) and machine learning based prediction, we start with assuming the shape of the function (reason explained below), and then we just find the coefficients that do most justice in explaining the training data using our function. The difference in the two techniques lies in the following:
If we start to look at the ways of doing better than using brute force (for finding both the shape and coefficients of our sought-after function), the first thing that comes to mind is the cost function $\boldsymbol{C}$ - the aggregate indicator of how distant the given dependent variable(s) are from the values predicted by our sought-after function - $y_i - \hat{y_i} \hspace1ex \forall{i_{training}}$ (let's call each of these distances the error). For example, $C$ can be (at our discretion) the sum of squares of all errors (or their mean), or mean absolute error (MAE) etc. How to choose one, and then use it?
For choosing $C$, since we are interested in just the overall distance of the predicted results from the actual values, there is no point in bringing in a complex function such as $cos( y_i - \hat{y_i})+ e^{y_i - \hat{y_i}} \text{ } + \cdots$ etc. One of the sum-of-squares-of-errors or MAE are good candidates. Between these two, in any analysis, since the former tends to penalize the bigger errors more than the latter (which is not bad many a time), and is smoothly differentiable, we can happily go with that.
Now, how to use this $C$ in our analysis? Well, both the shape and the values of coefficients in the expression for $y$ determine the value of the cost function. That is, there is a relationship between $y$ and $C$, and we want to have $C$ as small as possible. Can we use this constraint to determine the shape and the coefficients of $y$?
The first thing to note is that the relationship between $y$ and $C$ is theoretically analytically derivable. But the euphoria ends here. The relationship is generally very complex and becomes exceedingly so once we go beyond assuming a linear function for y and there are more than one independent variables.
But that aside, the question is - given $C = C(y)$, how can we find $y$ for which $C$ is minimum? If we, for a moment, forget about the complexity of $y$, this is actually one of the basic use-cases of differentiation. Let me explain. $C = C(y)$ means that $C$ is changing with $y$ but the magnitude of change in $C$ (for a small change in $y$) can be different for different values of $y$. That is, the rate of change of $C$ w.r.t. $y$ changes for different values of $y$. And the values of $y$ for which $C$ is minimum simply means this - the rate of change of $C$ at those places w.r.t. a small change in $y$ will be 0. This is because the rate of change in $C$ at such places has to change its direction (from +ve to -ve or vice versa) and hence, it is bound to come to zero (we are assuming $C$ to be a smoothly differentiable function). This is simply stated by $dC/dy$ = 0.
So, we can theoretically use the equation $dC/dy = 0$ to... find $y$? Eh, I know you can see the problem now. If we don't know $y$ at the first place, how can we even find $dC/dy$ and equate it to $0$? No, we can't. So we have to come back to guessing. This is what we'll have to admit - we have to guess the shape of the function and then we can do something to adjust its coefficients.
It turns out that but for the very simple choices for $y$, we cannot use $dy/dC = 0$ for finding the values of $a_i$'s. But the good news is, we can actually do that for a handful of simple functions!
At this point, note that if we start with a $y$ with pre-determined terms and to-be-adjusted coefficients $a_i$'s, we can write $C$ as $C$ = $C(a_1, a_2, a_3... a_n)$.
This is what is done in traditional prediction-oriented non-machine-learning programming - the coder restricts himself to (he has to - for the reasons explained above) using a specific function (or a handful of them) for fine-tuning its shape (aka adjusting its coefficients) using the training data, and then using this refined function for prediction. The function and its general shape are known to the coder (since he starts with assuming one), and he has to use the training data for adjusting this function (i.e. finding the values of the coefficients of the variables) so that its results align as closely as possible with those of the training set. This is the realization of the first of the strategies explained in the beginning of this chapter.
The adjustment (rather, calculation) of coefficients is done by the formulae derived in advance - by equating $\partial{C}/\partial{a}_i$ = 0. This programming comes under the umbrella term Data Science. Note the use of $\partial{}$ instead of $d$ - we can do that safely since the $a_i$'s are independent. Because then, we can independently change any (and all) of the $a_i$'s in order to minimize C.
We call this procedure regression which is prefixed with the term linear if the underlying assumed function was linear, and so on. For a quick refresher, see Linear Regression and Quadratic Regression (coming soon) on this page.
What if there was a way of adjusting the coefficients without needing to solve the equations $\partial{C}/\partial{a_i} = 0$. It turns out there is! And that is called Gradient Descent. This is our second helper.
Gradient Descent is going to a minimum or maximum of a function (here, $C(a_i)$) by making small adjustments to the independent variables ($a_i$'s) and watching out how that affects the outcome of the function. This idea in itself, even if it may sound promising, is not of much value. Random adjustments to $a_i$'s will not take $C$ to a minimum before the end of universe, especially when there are quite many of $a_i$'s. We can do better - the adjustments need not be random.
Remember - a partial derivative of $C$ w.r.t. $a_i$ tells us how $C$ changes by changing $a_i$ slightly at any given value of $a_i$.
So, if we start with any random value of $a_i$, calculate $\partial{C}/\partial{a_i}$ (for this $a_i$), and if we choose to change $a_i$ by an amount that is both opposite and proportional to $\partial{C}/\partial{a_i}$, we can say that we have gone to a better value of $a_i$. How? If $\partial{C}/\partial{a_i}$ is +ve, our change to $a_i$ will be negative, so that $C$ decreases. If $\partial{C}/\partial{a_i}$ is -ve (which means $C$ decreases on increasing $a_i$), our change to $a_i$ will be positive, so that again $C$ decreases. Similarly, if $C$ is not sensitive to the current value of $a_i$, it suggests that $C$ might already be close to a minimum w.r.t. current value of $a_i$, and so we should not be changing $a_i$ by a lot. And if $C$ is changing a lot w.r.t. the current value of $a_i$, we can change $a_i$ by a fair amount. This is the reason for changing $a_i$ proportionally to $\partial{C}/\partial{a_i}$.
Ok, let's pause for a second and revise what we have gained. Earlier we were solving $\partial{C}/\partial{a_i} = 0$ for finding exact values of $a_i$'s. Now we are not solving $\partial{C}/\partial{a_i} = 0$ (we are just calculating the values of $\partial{C}/\partial{a_i}$ - multiple times though - but computers are made exactly for this!), and finding very good values of $a_i$'s which may or may not be exact (i.e. $C$ may be very close but may not be exactly minimum for these values of $a_i$'s). This is a big relief. Depending upon the function (and for most of them), solving $\partial{C}/\partial{a_i} = 0$ can be practically impossible, but evaluating $\partial{C}/\partial{a_i}$ is not! The $\partial{C}/\partial{a_i}$ together for all $a_i$'s are known as the gradient of $C$, and denoted by $\nabla$.
So, how to use this big breakthrough? It's simple to see - now we can exercise more freedom in choosing our function $y = y(x)$. But, can we? Still there is an endless variety of functions to choose from. Let's move on to two other phenomena that we can use for our benefit.
The first one is the method employed by the animal kingdom for doing predictions. If we look at the neurons in the brains of animals, we see that each of them is fairly simple in itself - it doesn't do any calculation. It basically gets activated or not depending upon the connections coming from some of its connected neighbors, and if it is activated, it contributes in the activation of some of the other neurons (connected to it directly or indirectly). In a simplified sense, it basically goes by a linear function ${act} = z + b$, where $z$ is the net input coming to it, and $b$ is some inherent factor. This $act$ is then passed on to other neurons. Having so many of these working in this fashion actually results in the animal's ability to make very good predictions. So, what's going on here?
It turns out that a function that is comprised of multiple linear functions can be used to approximate any function of any complexity (shape). This is the second phenomenon. Stated as such, this statement is actually trivially simple to understand. More on this just a little while later.
What are these two mutually reinforcing ideas asking us to do? You got it. Between using simple general-purpose linear (or quadratic, etc.) functions and using overly-complex completely-guessed very-specific-to-given-training-data functions, there may be a middle way - using a general-purpose combination of linear functions that can be attuned (reshaped) to fit closely to given training data. Buddha wins! But no so easily.
While mimicking a complex function using a set of linear functions for given training data is easily possible, it won't do the trick that easily.
By bringing in composition, we have made a deep philosophical change in our approach (of mimicking a complex function by linear ones). With simple addition of linear functions (in non-overlapping subdomains), the impact of each linear function was completely limited to the subdomain it was applied on. And, those functions were all disconnected from one another.
Now with composition, the linear functions become interconnected. And that means that each coefficient in each linear function now has a global impact. So, the resulting function, though still comprised of linear functions, is actually now a complex function with its shape sensitive to each of the coefficients - not region-wise but in totality. Any coefficient can now affect the shape of the function anywhere across the training domain.
This suggests that if we are able to find a set of coefficients for which our composite function mimics the actual function (the theoretical perfect function) closely, these coefficients, by not being local, are taking into account the changes happening (to the shape) everywhere across the domain - or in other words - these coefficients have learnt the shape of the function. Welcome! You have officially entered the realm of Machine Learning. If the learning part is true, using these coefficients to interpolate should work quite satisfactorily, and, if we are lucky, even extrapolation may work as well to some extent. The idea is promising, but it needs one final touch.
Let's pick one last piece from our biological-neuron-analogy - the activation of each neuron, though related to a linear combination of inputs to it, is not linear in itself. First of all, linear values can go from -infinity to +infinity, but activation is like having values between 0 and 1 (loose argument). More importantly, the activation is generally not simply proportional to the input in a linear fashion. It may change a lot for some values of input, but may be flattened near the extreme values. On the other hand, in the model we have developed so far, we hope to mimic the actual shape by using a bunch of linear functions. Linear functions are after all linear - they don't appreciate curves 😊.
Technically speaking, at present, our model is, by and large, a complex-but-still linear regression. Is there a chance we can introduce non-linearity into our model and breathe the freedom of non-linear regression (and thus make our model even more analogous to the working of biological neurons)? There, of course, is. We can feed the linear input + inherent bias into any non-linear function and take the output of that function as the activation of our artificial neuron. These functions are called, any guesses? Activation functions (our third helper). $\frac{1}{1+e^{-z}}$ is a common choice for such an activation function, and it is denoted by $\sigma(z)$. We'll use $\sigma$ in our analysis henceforth just to keep ourselves reminded about the non-linearity of the function (but note that there could be any such function in its place).
You might have noticed how slickly I missed an important detail. We saw how it is so very easy to mimic any function by using piecewise linear functions (although, to be honest, at the cost of losing continuity) - this phenomenon in general (and, in a modified way) is called universal approximation. And then we moved on to composite linear functions and even introduced non-linearity by using activation functions. Does the idea of universal approximation still hold? This is a very important aspect not to be overlooked. Because if it doesn't, the complex composite convoluted function that we have arrived at will lose its significance as it may not work in most cases. The good news is that this idea still holds! In fact, now it holds without losing continuity.
Have a quick look at the supplemental section Approximating Any Function using Linear and Squashing Functions.
So, in machine learning, we are using gradient descent over a highly convoluted composite function with numerous coefficients in order to find the values of those coefficients for which the function is at a minimum. The coefficients are, by convention, called weights, and we shall denote it by $w$ henceforth. And the proportionality constant used in gradient descent is called the learning rate - because this decides the magnitude of our movement towards the minimum in each iteration. A little reflection will tell that we cannot just simply have the learning rate as high as possible and quickly move to the minimum. Let's say, the current value of $C$ is 50, while the minimum value (that we don't know yet) is 23. For simplicity, assume $\frac{\partial{C}}{\partial{w}}$ close to 1 everywhere. If we choose to have a learning rate of 5, initially we'll see good progress, but we'll not be able to settle on any value between 20 and 25. On the other hand, if have the learning rate as 1, our initial progress will be slow, but we shall be able to get to 23, but what's worse is that we may unwantedly settle on a local minimum.
This is it. We have landed ourselves successfully into the world of machine learning. This is nothing else but a formal continuation of the second strategy mentioned in the beginning of this chapter. Remember our three helpers - the cost function $C$, the gradient descent $\nabla$ and the activation function $\sigma$.
In the next chapter we shall adorn our philosophical ideas with mathematical ornaments. And we will also tackle some of the theoretical and practical hurdles that we are going to learn about.
In the meantime, if you want to engage ML into doing some predictions for you, try this.
This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.
For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.
Supplement to section C1.4 The Traditional Predicting Methods aka Regression.
If we take cost function to be the sum of squared errors, for linear regression it becomes $C = \sum_{i=1}^n{(Y_i - \hat{Y}_i)^2}$ $ = \sum_{i=1}^n{(Y_i - ax_i - b)^2}$.
We minimize $C$ by equating $\frac{\partial{C}}{\partial{a}}$ and $\frac{\partial{C}}{\partial{b}}$ to $0$. From the latter, we get:
$$ \begin{align} 0 & = \frac{\partial{C}}{\partial{a}} = \sum_{i=1}^n{(Y_i - ax_i - b)} \\
& = \sum_{i=1}^n{Y_i} - a\sum_{i=1}^n{x_i} - \sum_{i=1}^n{b} \\[7pt]
& = \sum_{i=1}^n{Y_i} - a\sum_{i=1}^n{x_i} - nb \\[7pt]
\therefore b & = \frac{\sum_{i=1}^n{Y_i}}{n} - \frac{a\sum_{i=1}^n{x_i}}{n} \\[7pt]
& \boxed{b = \bar{Y} - a\bar{X}}
\end{align} $$
From the former:
$$ \begin{align} 0 & = \frac{\partial{C}}{\partial{b}} = \sum_{i=1}^n{x_i(Y_i - ax_i - b)} \\
& = \sum_{i=1}^n{(x_iY_i - ax_i^2 - bx_i)} \\[7pt]
& = \sum_{i=1}^n{(x_iY_i - ax_i^2 - (\bar{Y} - a\bar{X})x_i)} \\[7pt]
& = \sum_{i=1}^n{(x_iY_i - \bar{Y}x_i}) + \sum_{i=1}^n({ax_i^2 - a\bar{X}x_i}) \\[7pt]
\therefore \enspace & \boxed{a = \frac{\sum_{i=1}^n{(x_iY_i - \bar{Y}x_i})}{\sum_{i=1}^n({x_i^2 - \bar{X}x_i})}}
\end{align} $$
So, here we have analytical formulae for $a$ and $b$ which we can evaluate when we have the examples. No need of any further adjustment needed simply because it was possible to derive such formulae for $a$ and $b$.
Supplement to section C1.7 The Matrix of Globally Interconnected (Linear) Functions.
In the paper titled Multilayer Feedforward Networks are Universal Approximators by K. Hornik shows that
...a multilayer feedforward network with as few as one hidden layer using arbitrary squashing functions are capable of approximating any... function..."
Another one is a paper by G. Cybenko that proves the above for Sigmoidal function - Approximation by Superposition of Sigmoidal Function. I'm including it below for quick reference.
Return to Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning