Statistics for Programmers - Frequency Distributions

A Frequency Distribution is a common way to understand a trend in a dataset. It's a tabular representation of the number of times a value appears in a dataset. If we denote values in a dataset as \(x_1, x_2, \ldots, x_n\), their corresponding frequencies can be denoted as \(f_1, f_2, \ldots, f_n\). This relationship can be expressed as a table.

\[ \begin{array}{|c|c|} \hline \text{Value (}x\text{)} & \text{Frequency (}f\text{)} \\ \hline x_1 & f_1 \\ x_2 & f_2 \\ \vdots & \vdots \\ x_n & f_n \\ \hline \end{array} \]

Applying this practically, let's consider a dataset of 10 users who were asked to review a product on a scale of 1 to 5. The dataset can be represented as an array of reviews.

[3, 1, 5, 5, 2, 4, 5, 3, 1, 5]

We can construct a frequency distribution table for this dataset by counting the number of times each unique element appears in the array.

Value (x) | Frequency (f)
-------------------------
1         | 2
2         | 1
3         | 2
4         | 1
5         | 4

This can be expressed in code using a Map (or Dictionary depending on your language of choice) of unique values and how many times they appear in a given dataset.

Once again considering our array of reviews,

const arr = [3, 1, 5, 5, 2, 4, 5, 3, 1, 5];

We can construct a function that counts the number of times each unique element appears in the array.

function frequencyDistribution(arr) {
  const map = {};
  for(let i = 0; i < arr.length; i++) {
      const item = arr[i];
      if (map[item]) {
          map[item] += 1;
      } else {
          map[item] = 1;
      }
  }
  return map;
}

Applying this function to our dataset gives us the following output,

console.log(frequencyDistribution(arr));

// { '1': 2, '2': 1, '3': 2, '4': 1, '5': 4 }

The distribution map makes it clear that 5 has the highest frequency appearing 4 times in the array, while 4 and 2 have the lowest frequency appearing only once.

Relative Frequency Distribution

Now that we know the frequency distribution of our data we might want to a measure of how often each value appears in our dataset. We can accomplish this by dividing the frequency distribution of each unique item in our dataset by the total number of items.

\[ \text{Relative Frequency (} rf \text{)} = \frac{\text{Frequency (} f \text{)}}{\text{Total number of observations in the dataset}} \]

Consider our dataset of reviews, the relative frequency of a 5-star rating would be

\[ \text{Relative Frequency (5 star)} = \frac{4}{10} = 0.4 \]

We can express this as code by building on the frequencyDistribution function we wrote earlier.

function relativeFrequencyDistribution(values) {
  const distribution = frequencyDistribution(values);
  for (const prop in distribution) {
    distribution[prop] /= values.length;
  }
  return distribution;
}

Invoking this on our dataset gives us the following output,

console.log(relativeFrequencyDistribution(arr));

// { '1': 0.2, '2': 0.1, '3': 0.2, '4': 0.1, '5': 0.4 }

Since the sum of all relative frequencies should always equal 1, this effectively gives us percentages when the values are multiplied by 100. This means that our output shows that 1 accounts for 20% of the dataset with a value of 0.2, while 5 accounts for 40% with a value of 0.4.

Cumulative Frequency Distribution

A cumulative frequency distribution is the sum of relative frequencies up a value. This assumes that the dataset is quantitative or ordinal and as a result, has a natural sort order. For example, considering our array of ratings, we can assume that a 2-star rating ranks higher than a 1-star rating. Therefore the cumulative distribution of a 2-star rating would be

\( \text{Cumulative Frequency(1 star)} = \text{Frequency(1 star)} + \text{Frequency(2 star)} \)

In other words, we could generalize this as,

\[ \text{Cumulative Frequency (} cf_i \text{)} = f_1 + f_2 + \ldots + f_i \]

We can express this as code by computing the cumulative total value up until that point.

function cumulativeFrequencyDistribution(values) {
  const distribution = frequencyDistribution(values);
  let cumulativeValue = 0;
  for (const prop in distribution) {
      cumulativeValue += distribution[prop];
      distribution[prop] = cumulativeValue;
  }
  return distribution;
}

Running this against our data gives us the following result

console.log(cumulativeFrequencyDistribution(arr));

// { '1': 2, '2': 3, '3': 5, '4': 6, '5': 10 }

Classes and Intervals

The frequency distributions we've looked at so far work well for small data sets. However, in the real world, we'll likely be working with large datasets that have hundreds if not thousands of unique values. As a result, it's often helpful to construct classes as a way to group the data. This helps keep the presentation compact. Classes typically describe an upper and lower bound. Values that fall within the upper and lower bounds contribute to its frequency.

We can create classes for our dataset by defining upper and lower bound values that correspond to a sentiment.

const intervals = {
	negative:  {
		lower: 1,
		upper: 2
	},
	neutral: {
		lower: 3,
		upper: 3
 	},
	positive: {
		lower: 4,
		upper: 5
	}
}

We can then use capture the frequencies of the values and decide which class they belong to.

function frequencyDistributionWithInterval(values, intervals) {
  const distribution = frequencyDistribution(arr);
  
  // Initialize the map to store the frequency distribution within each interval
  const map = {};

  // Loop through each interval and calculate the frequency within that interval
  for (const [intervalName, { upper, lower }] of Object.entries(intervals)) {
    let frequency = 0;

    // Loop through the frequency distribution and sum up the frequencies within the interval
    for (const [value, valueFrequency] of Object.entries(distribution)) {
      if (value >= lower && value <= upper) {
        frequency += valueFrequency;
      }
    }

    // Store the frequency within the interval in the map
    map[intervalName] = frequency;
  }

  return map;
}

Running this function against our data gives us a nice compact view of our data with our class groupings.

console.log(frequencyDistributionWithInterval(arr, intervals))

// { negative: 3, neutral: 2, positive: 5 }

While there are no fixed rules for how to define your classes, it's important to note that they can influence how your data is interpreted. So here are a few guidelines that can help define good intervals:

  • have between 5 and 15 classes - Having too few classes may oversimplify the data, while too many classes can obscure meaningful patterns and insights.

  • have equal width - This means that the difference between the upper and lower bounds should be the same across all class intervals. This facilitates accurate comparisons and interpretations across the dataset.

  • be mutually exclusive - Each data point should be placed in only one class interval, without overlapping between adjacent intervals. This ensures that data is accurately and uniquely represented within the distribution.

  • be exhaustive - The class intervals must encompass the entire range of data points. No data point should fall outside the defined intervals, ensuring that all observations are accounted for

Frequency distributions are a great way to visualize and extract useful information from data at a glance. While they can be useful on their own, they trully shine when paired with graphing techniques to visualize the results.

Subscribe to Another Dev's Two Cents

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe