Statistics for Programmers - Measures of Dispersion
Measures of central tendencies like the Mean, Median and Mode summarize the data into a single value. However, they also lose a lot of information. For example if I tell you that the average salary of a given sample of workers $100,000, you might think that most workers make around $100,000. But in reality, the salary distribution might look like this:
// Salaries in $
[200000, 10000, 100000, 180000, 10000]
The mean is $100,000, but most workers make $10,000. This is why we need measures of dispersion. They tell us how spread out the data is. The most common measures of dispersion are the Range, Variance and Standard Deviation.
Range
The range is the difference between the largest and smallest values in a dataset. It is the simplest measure of dispersion. It is simply the difference between the largest and smallest values in a dataset. For example, the range of the following dataset is 9:
The range of a set of numbers is a measure of the spread or dispersion of the values within that set. It is calculated as the difference between the maximum and minimum values in the set. Mathematically, the range (R) is expressed as:
\[ R = \text{Max} - \text{Min} \]
Where:
- \( R \) is the range,
- \(\text{Max}\) is the maximum value in the set, and
- \(\text{Min}\) is the minimum value in the set.
Applying this formula to the set of numbers above, we get:
200000 - 10000 = 190000
Representing this in code is very simple:
const arr = [200000, 10000, 100000, 180000, 10000];
const max = Math.max(...arr);
const min = Math.min(...arr);
const range = max - min;
console.log(range);
// 190000
The range is very easy to calculate, but it is also very sensitive to outliers because it only depends on the largest and smallest values in the dataset.
Variance
The variance is a measure of how far each value in a dataset is from the mean. It is calculated by taking the average of squared deviations from the mean.
There are two types of variance: population variance and sample variance. The population variance is used when the dataset includes all the members of a dataset
Mathematically, the population variance (\( \sigma^2 \)) (pronounced sigma squared) is expressed as:
\[ \sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \mu)^2} {n} \]
Where:
- \( \sigma^2 \) is the variance,
- \( \sum_{i=1}^{n} \) is the sum of the squared differences between each value \( x_i \) in the dataset and the population mean \( \mu \) (pronounced mu or mew),
- \( n \) is the number of values in the dataset.
The variance is calculated in two steps. First, we calculate the difference between each value and the mean. Then we square the differences and take the average of all the squared differences. Let's see how this works with an example. Considering our salaries dataset:
[200000, 10000, 100000, 180000, 10000]
The mean is \( \mu = 100000 \). The differences between each value and the mean are:
[100000, -90000, 0, 80000, -90000]
Squaring the differences gives us:
[10000000000, 8100000000, 0, 6400000000, 8100000000]
Finally we sum the squared differences and divide by the number of values in the dataset (Effectively taking the average):
(10000000000 + 8100000000 + 0 + 6400000000 + 8100000000) / 5 = 6520000000
Representing this in code is very simple:
function populationVariance(population) {
const populationMean = mean(population);
const differences = arr.map((x) => x - populationMean);
const squaredDifferences = differences.map((x) => x * x);
const sumOfSquaredDifferences = sum(squaredDifferences);
return sumOfSquaredDifferences / arr.length;
}
const arr = [200000, 10000, 100000, 180000, 10000];
console.log(variance(arr));
// 6520000000
If you have only a sample of the population, you should use the sample variance instead. The sample variance is calculated in the same way as the population variance, except that we divide by \( n - 1 \) instead of \( n \).
Mathematically, the sample variance (\( s^2 \)) is expressed as:
\[ s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2} {n - 1} \]
Where:
- \( s^2 \) is the sample variance,
- \( \sum_{i=1}^{n} \) is the sum of the squared differences between each value \( x_i \) in the dataset and the sample mean \( \bar{x} \) (pronounced x-bar),
- \( n \) is the number of values in the dataset.
To see it in action we only need to change the denominator in our previous example:
(10000000000 + 8100000000 + 0 + 6400000000 + 8100000000) / (5 - 1) = 8150000000
Updating our code is just as simple:
function sampleVariance(arr) {
const meanOfArr = mean(arr);
const differences = arr.map((x) => x - meanOfArr);
const squaredDifferences = differences.map((x) => x * x);
const sumOfSquaredDifferences = sum(squaredDifferences);
return sumOfSquaredDifferences / (arr.length - 1);
}
const arr = [200000, 10000, 100000, 180000, 10000];
console.log(sampleVariance(arr));
// 8150000000
Standard Deviation
The variance is a very useful measure of dispersion, but it is not very intuitive because the value is expressed as a square of the original unit. (Looking back at our salary example, the variance is expressed in squared dollars)
The standard deviation addresses this by expressing the same units as the original data. It is simply the square root of the variance.
Mathematically, the population and sample standard deviations \( \sigma \) and \( s \) are expressed as:
\[ \sigma = \sqrt{\sigma^2} \]
and
\[ s = \sqrt{s^2} \]
Where:
- \( \sigma \) is the population standard deviation, and
- \( \sigma^2 \) is the population variance.
- \( s \) is the sample standard deviation, and
- \( s^2 \) is the sample variance.
Applying this formula to our previous example, we get:
\[ \sqrt{6520000000} = 80748.09 \]
Representing this in code is very simple:
function populationStandardDeviation(arr) {
return Math.sqrt(populationVariance(arr));
}
function sampleStandardDeviation(arr) {
return Math.sqrt(sampleVariance(arr));
}
const arr = [200000, 10000, 100000, 180000, 10000];
console.log(populationStandardDeviation(arr));
// 80746.5169527454
console.log(sampleStandardDeviation(arr));
// 90277.35042633895