"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
July 14, 2020
Author - manisar
Welcome to one of the simplest yet very intriguing and still-going-on debate in the world of statistics.
On a cursory look, MAD seems to be perfect – we want to know – on an average – how far each of the numbers in a set of observations is from their mean (M), and MAD tells us exactly that. Then what is the problem? Why don’t we use MAD everywhere instead of σ?
Before moving further, if you wanted to revisit the formulae, here they are:
Well, one reason for the widespread use of σ is that σ is algebraically easier to use, the square function – \(value^2\) – fits in everywhere more easily as compared to the absolute function – \(|value|\), and it’s smoothly differentiable. But, as my wife always comes back whenever I give her a reason of me loving her, with this – “just for this one reason?”, one is tempted to ask the same question here – “only because of this reason?”. No, this shouldn’t be the deciding reason.
Let’s talk numbers.
The two sets mentioned above show very beautifully the significance of Standard Deviation.
The video below shows the two sets. We can clearly see that as {1, 1, 7} transitions to {0,2,7}, while the mean and MAD remain the same, σ increases, and it expectedly shows the difference in spatial arrangement of the two sets - {0,2,7} is indeed more widespread than {1,1,7}.
So, we see that within a set, its numbers (or some of them) can conspire so as to result in same MAD for the set (or, in providing the same contribution to MAD) while rearranging themselves in different ways (close together, or far apart). This conspiracy is revealed by Standard Deviation!
σ gives us an idea about the arrangement of the numbers within the set - at a cost though - we do not get the true average of deviations (what we get is a biased average) - but the benefit, along with the fact that σ is smoothly differentiable, overrides the shortcoming!
Read on to go deeper and deeper...
Let's dissect one of the (if not the) most beautiful equations in mathematics.
Archimedes' principle is straightforward, but let's see if there are other more natural explanations.
So, how does σ come to depict the spatial arrangement of a set which MAD is not able to do?
Let's start to look at things from a geometrical point of view.
We need to first understand the definition of the word – far. Let’s elaborate.
For a moment, forget about A and B, and think of these two points in 2D: P(2,2) and Q(3,1).
If I ask – how far is P from the axes – on an average – the answer would be \(\frac{2+2}{2} = 2\), and it’s the same - \(\frac{3+1}{2} = 2\) for Q. So, if we move only parallel to the axes, and since there are two axes, for reaching both P and Q from, say, origin O, we’ll have to move 2(av. distance per axis) x 2(number of axes) = 4 units.
But, in real world, in order to reach point P or Q (from O), we don’t move along the axes (unless there is some constraint, which comes under the taxicab geometry), we go straight from O to P or Q, and that distance is not the same for the two points. \(OP = \sqrt8 = 2.83\), and \(OQ = \sqrt{10} = 3.16\)!
Both A and B are just sets of numbers or observations. They are not points in an n-D world. But, if we visualize each deviation as distance from the mean, and further, if, instead of visualizing all the deviations of a given set as distances in the same dimension, we picture each of them denoting the length of a unique dimension in an n-D world, so that each of A and B becomes a distinct point in an n-D world (here 3D), we at once, see that B is farther away from its mean as compared to how far A is from its mean, though MAD is same for both A and B. This is what is denoted by standard deviation.
Even if MAD is same for two sets of observations, by evaluating σ, we get to see which set of observations is, as a whole, farther away from its mean. Or, in other words, which set has more widespread observations*.
Mean \(\implies\) around which number the observations are centered. But a set can have its observations quite far from the mean, on an average, as compared to another set having the same mean. In order to get that information (i.e. the average distance of observations from its mean), we move to MAD.
MAD \(\implies\) how far each observation, on an average, is from the mean of all observations, but it doesn't tell how the observations are arranged in relation to one another. Two sets can have the same MAD which means that observations in both the sets are, on an average, equally far from their mean. But there may be more disparity in the deviations (distances from the mean) in one set compared to the other. To get that information (i.e. the distance of the set itself from its mean, which depends upon how the observations are arranged in relation to one another), we move to σ.
σ \(\implies\) how far the complete set is from its mean (or, how disparate the individual deviations are which, in turn, depends upon how far the observations are from each other).
*Why is a widespread set is, as a whole, farther from its mean? This is simply because if two points are far apart, so will their distances be from any reference point (including the mean of the set)! See the section Mean of Squares − Square of Mean = Variance! for an elaborate discussion on this.
And two sets having their individual observations equally distant (on an average) from their respective means (i.e. having the same MAD) can be, as a whole, at different distances from their respective means (i.e. can have different σ’s)!
In the given sets A(1,1,7) and B(0,2,7) above, each observation in A is, on an average, equally distant from its mean as each observation in B is from its own mean (which happens to be the same as A's). This is shown by the MAD being same for the two sets. But, observations in B are farther apart from each other compared to how those are in A, and this fact is indicated by their different σ's.
Thus, we see that by looking at these two statistical concepts (MAD and σ) from a geometrical point of view, we get some insight into the arrangement of observations - not only in relation to the mean, but also in relation to one another.
Note that in telling how far the observations are from each other, σ also tells how far the observations are from their mean (not in the exact sense like MAD because here we have the (rooted) average of squares instead of the average of the real distances, but it does). So, σ includes (more-or-less) the information that is provided by MAD, plus it has something extra to tell. Hence, it's wiser to choose σ between MAD and σ as a uniform indicator of deviation of observations.
We must clarify σ's meaning in terms of the distance of observations from their mean and from one another - it is actually slightly different from the actual distance concept explained above. Because with σ, we average out the squared orthogonal distances (rather than adding them straight up). But, the basic relationship between MAD and σ remains similar to what we have between distance-added-up-by-going-parallel-to-the-axes-from-the-mean-to-the-point and distance-registered-by-going-straight-from-the-mean-to-the-point.
The difference is because by squaring a set of numbers, we change their mutual proportions (squaring each number is not the same as multiplying all of them by a constant, which keeps their mutual proportions unchanged), and this change is not cancelled by averaging (i.e. by dividing each squared number by a constant). Hence, σ, only kind of, represents the distance of the (pictured) point from its mean.
Last point – by calculating σ, we actually give outliers (big numbers) more weight-age in a set of observations. Check the description given on this page: Mean of Squares ≥ Square of Mean.
So, between two given sets, if the observations are arranged in such a way that they are equally far (on an average) from their mean, i.e. the MAD is same for both, the set having bigger numbers will have a greater σ.
Finally, if you are fanatically inquisitive, you would ask - why stop at squaring? Why not cube and take their absolute values, average out and then take cube root? Why not go to the fourth power? The higher the power, more will be the weight-age given to the outliers.
I don’t have a fool-proof answer to this but it's easy to see that by having another deviation - say, cubic absolute deviation, we'll not get any other information about the set that we have not already got by calculating M, MAD and σ. Further, in the real flat Pythagorean world of ours, distances are calculated by squaring only (and not by cubing or going to fourth power and so on). And though we don’t really calculate σ in order to find the distance of the (visualized) point (represented by the set) from its mean (all the explanation above is just a loose analogy), we just stop at squaring.
σ does denote the spread of a set of observations in a loose sense of distance, which does give more weight-age to the outliers. But that happens in the real world’s flat geometry as well – (3,1) is farther from origin as compared to (2,2), despite both being equally distant from the axes – only because (3,1) contains a bigger number. This is a direct side-effect of the Pythagoras’ principle which the Euclidean geometry follows. We just decided to conceptualize the spread in statistics on the same lines as we have distance in geometry, and hence we have squaring. While not absolutely necessary (in my view), this does prove to be beneficial in the cases where the spread is related to actual distances.
Remember absence of evidence is not evidence of absence? Well, the one we are tackling below is less confusing. Let’s quickly clarify it for ourselves once and for all. This is the reason why σ (standard deviation) is always \(\ge\) the MAD (mean absolute deviation).
Let’s consider a set of two numbers S(3, 5). Their M(mean) = 4.
Now, let’s inflate – i.e. multiply each number by a constant c , say 10 so that S’ = (30, 50) with M’ = 40.
Here we see that M’ = M x c. That is, mean of each-number-times-c is equal to original-mean-times-c.
Then what happens when we square? When we square each number in a set, what happens is that each number is simply not inflated by the same amount, it’s inflated disproportionately. The bigger the number is, the more it is inflated (this is called exponential inflation)!. Hence, the mutual ratios of the numbers change, and the new mean tilts towards the bigger numbers (as compared to the square of the original mean).
Let’s prove it for two numbers a and b.
Mean of squares = \(\frac{a^2+b^2}{2}\)
Square of mean = \(\left(\frac{a}{2}+\frac{b}{2}\right)^2\)
Let’s find out their difference…
\[ \begin{align} &~\small{\text{Mean of squares}} - \small{\text{Squares of mean}} \\\\
&~=\frac{a^2+b^2}{2}-\left(\frac{a}{2}+\frac{b}{2}\right)^2\\
&~=\frac{1}{4}\left[2a^2+2b^2-a^2-b^2-2ab\right]\\
&~=\frac{1}{4}\left(a-b\right)^2 \\\\
&~\ge 0
\end{align}\]
Finally, let’s seal the deal with this visualization. If we draw squares corresponding to the two numbers mentioned above and their mean, we get this picture.
Let B denote the square of the original mean (4), and A and C the squares of the two numbers (3 and 5 respectively)
We see clearly that though 4 lies exactly between 3 and 5, \(4^2\) does not lie in the middle of \(3^2\) and \(5^2\).
\(4^2\) is 16, whereas the middle (mean) of \(3^2\) and \(5^2\) is 17! So, we see that the new mean has moved towards the bigger number. As is clear from the given figure as well.
In the figure,
C - B = (\(5^2\)) – (\(4^2\)), and
B - A = (\(4^2\)) – (\(3^2\))
(C - B) is clearly \(\ne\), instead \(\ge\) (B - A).
If these were equal, B would have been the mean of A and C, but that is not the case. The bigger number (5) has been inflated so much by being squared that it is now much bigger than the square of the original mean, as compared to how much the square of the smaller number (3) is smaller than the square of the original mean.
The difference between MoS and SoM turns out to be nothing else but variance, i.e. the mean of squares of individual deviations.
Let's put it more intuitively.
We have seen that when we square each number in a set of observations, their new mean shifts away from the square of the original mean (towards the larger numbers).
Now, this shift, in general, is proportional to how big the bigger numbers are - in relation to the smaller numbers. Here the italicized part is of utmost importance.
For example, for {1,3}, the shift is \( (1^2 + 3^2)/2 - 2^2 = 1\), and for {3,5}, again the shift is \( (3^2 + 5^2)/2 - 4^2 = 1\). Here, even though, {3,5} has bigger numbers compared to {1,3}, both 3 and 5 are bigger than 1 and 3 respectively in such a way that their relative difference is the same - i.e. 5 is only as much bigger than 3 as 3 is bigger than 1. Hence, the shift (of their mean on squaring) is just the same.
This happens because if the smaller numbers are also bigger, the original mean is already bigger (compared to what it would have been if the smaller numbers were smaller). So this original mean in this case, when squared, turns up close to the new mean (and the shift becomes small). Hence, it makes sense to talk about the sets having the same mean when comparing them for this shift.
So we see that it's the relative distances between numbers in a set (i.e. how much the set is spread-apart) that matter for the amount of shift, and not where the set (i.e. its mean) lies on the number line. Let's call it the first picture (or first visualization).
But the phenomenon of relative-distances-between-the-numbers-in-a-set has a corollary - a set with its numbers lying farther away from each other (in comparison to another set) will also have this characteristic - some of its numbers will be individually farther away from the mean of the set (as compared to how far any of the numbers in the other set are from the mean of that set), and some will be closer.
This can be simply understood by considering the fact that distance between any two points, say x
1
and x
2
in a set can be broken into two by having a reference point and taking the difference of distances of x
1
and x
2
from this reference point. This reference point can be common for all points in the set and can conveniently be the:
x
1
and x
2
is calculated anyway! Or,So, if x
1
and x
2
are far apart, so will their deviations be (from any third point, including the mean of the set). This can also be understood by considering the fact that algebraically, the set of deviations (from any point, not just mean) in a set is just the original set translated by a certain amount.
For example, in these two sets that have the same mean and same MAD: {1,1,7} and {0,2,7}, it's the latter which is more spread apart, and the distances of their members from their mean (=3) are given by {2,2,4} and {3,1,4}. We see that the latter has the deviations that are more disproportionate (ignore 4 in each and compare {2,2} and {3,1}).
In other words, between two sets that have the same mean and even the same MAD, the set that is more spread-apart will have its individual deviations more disparate. But we have already seen that the more spread-apart set gets more shift in its mean (on being squared). So, now we are starting to see how individual deviations (which are related to Variance) can be responsible for the shift. Let's call it the second picture.
This means that for two sets having the same mean, and the same MAD (average distance of each number from the mean), it's the set that has outliers and inliers - observations that are individually farther from, and nearer to, the mean - as compared to any observation in the other set, will have more shift in its mean on being squared. Note that more outliers automatically mean more inliers as well (in order to preserve the values of mean and MAD).
The numbers arrange themselves beautifully, and this shift is found to be proportional to how much each number was originally away from the original mean (and not just the outliers or inliers). It's actually the bigger numbers that contribute to this shift, but the amount by which they contribute depends upon the smaller numbers (the spread of the set - first picture), or, in other words, the presence of smaller smaller numbers makes the deviations (from the mean) more disparate (second picture). Hence all numbers come into play democratically!
It turns out that the relation is as simple as it could be - the shift in mean on squaring a set (MoS - SoM) is simply equal to the average (mean) of the squares of how much each number was originally shifted away from their (original) mean - which is just the Variance!
So, on squaring two sets of numbers having the same mean, the set in which some of the numbers are farther away from their mean (and a result, some other numbers closer to the mean) - not on an average, but individually, i.e. the set having more disparate deviations (or, in other words, the set in which the numbers are farther apart from each other) - will have the new mean shifted away by a larger amount.
And, for each number, its distance from the original mean (deviation) contributes to this shift by an amount equal to the square of this distance, but not absolutely, instead deflated by the number of observations (equivalent to taking mean).
The above is just verbosity for :
\(\text{Mean of Squares} - \text{Square of Mean} \)
\( = \text{Variance}\) !
Check Wikipedia for a formal proof of this equivalence.