Statistics for Programmers - Measures of Dispersion
Measures of central tendencies like the Mean, Median and Mode summarize the data into a single value. However, they also lose a lot of information. For example if I tell you that the average salary of a given sample of workers $100,000, you might think that most workers make around $100,000. But in reality, the salary distribution might look like this:
// Salaries in $
[200000, 10000, 100000, 180000, 10000]
The mean is $100,000, but most workers make $10,000. This is why we need measures of dispersion. They tell us how spread out the data is. The most common measures of dispersion are the Range, Variance and Standard Deviation.
Range
The range is the difference between the largest and smallest values in a dataset. It is the simplest measure of dispersion. It is simply the difference between the largest and smallest values in a dataset. For example, the range of the following dataset is 9:
The range of a set of numbers is a measure of the spread or dispersion of the values within that set. It is calculated as the difference between the maximum and minimum values in the set. Mathematically, the range (R) is expressed as:
\[ R = \text{Max} - \text{Min} \]
Where:
- \( R \) is the range,
- \(\text{Max}\) is the maximum value in the set, and
- \(\text{Min}\) is the minimum value in the set.
Applying this formula to the set of numbers above, we get:
200000 - 10000 = 190000
Representing this in code is very simple:
const arr = [200000, 10000, 100000, 180000, 10000];
const max = Math.max(...arr);
const min = Math.min(...arr);
const range = max - min;
console.log(range);
// 190000
The range is very easy to calculate, but it is also very sensitive to outliers because it only depends on the largest and smallest values in the dataset.
Variance
The variance is a measure of how far each value in a dataset is from the mean. It is calculated by taking the average of squared deviations from the mean.
There are two types of variance: population variance and sample variance. The population variance is used when the dataset includes all the members of a dataset
Mathematically, the population variance (\( \sigma^2 \)) (pronounced sigma squared) is expressed as:
\[ \sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \mu)^2} {n} \]
Where:
- \( \sigma^2 \) is the variance,
- \( \sum_{i=1}^{n} \) is the sum of the squared differences between each value \( x_i \) in the dataset and the population mean \( \mu \) (pronounced mu or mew),
- \( n \) is the number of values in the dataset.
The variance is calculated in two steps. First, we calculate the difference between each value and the mean. Then we square the differences and take the average of all the squared differences. Let's see how this works with an example. Considering our salaries dataset:
[200000, 10000, 100000, 180000, 10000]
The mean is \( \mu = 100000 \). The differences between each value and the mean are:
[100000, -90000, 0, 80000, -90000]
Squaring the differences gives us:
[10000000000, 8100000000, 0, 6400000000, 8100000000]
Finally we sum the squared differences and divide by the number of values in the dataset (Effectively taking the average):
(10000000000 + 8100000000 + 0 + 6400000000 + 8100000000) / 5 = 6520000000
Representing this in code is very simple: