Unraveling the Concept of Variability

Sandeep Bansal
Analytics Vidhya
Published in
3 min readDec 25, 2021

--

When I began learning data science one of the first things people emphasized was the importance of describing the center (mean and median) of a given data set. Although reporting the center is crucial to data analysis, it’s equally important to remember that the mean and median do NOT describe the spread of any given data set. Well if that’s the case then what metrics do?

The most common measures of variability describe the extent to which the sample observations deviate from the sample mean x̄. These measures include:

  1. Range
  2. Deviation/ Squared Deviation
  3. Standard Deviation (σ)
  4. Variance (σ²)

Range

often referred to as the simplest numerical measure of variability is defined as the difference between the largest and smallest observation. Although Range is considered the simplest, it is not an ideal way of describing the dispersion of a data set. Remember: Variability is a method used to describe the spread of observations from the sample mean (x̄), and in order to accurately quantify the spread of a data set each observation must be used as opposed to just the largest and smallest value.

Deviation

is a metric that many individuals overlook when studying statistics as it pertains to Data Science. Instead of memorizing the term: Standard Deviation it is much easier to take a step back to define and understand each of those terms in isolation before putting them together.

Remember: Variability is a method used to describe the spread of observations from the sample mean (x̄). Unlike the range, the deviation metric utilizes each observation by calculating the distance of each data point from the sample mean (x̄). Let’s do a quick example:

Determine the Variability by calculating the Deviation metric:

x̄ = 105

First we need to determine Sample mean

Now we can calculate Deviation for each Observation:

Results:

Notice the negative answers.

Does our answer make sense?

According to the math it makes sense. It seemed like a simple process. First we calculated the sample mean by taking the sum of the observations and then we divided by the number of observations. We then subtracted each observation from the sample mean and got an output of negative numbers. There are two questions that need to be addressed from this point:

1. How do we combine the data into a single numerical measurement to present to stakeholders?

2. How do we assess the negative sign?

You’re probably thinking: well can’t we calculate the average of the deviation column? No unfortunately not. While this will help with the first question in providing a single numerical measurement, it does not remove the negative operator. Therefore we must do the next best thing: Calculate the squared Deviation and take the sum.

Well, that answers both of our questions but what’s the point? Take a look at the squared deviation formula. Now compare that formula to the sample variance formula. See anything similar? How exciting! It’s slowly coming together!

If you don’t see the similarities between the two don’t feel bad. Feel free to drop a comment and I’ll be happy to help!

Thanks for reading! The next article will cover variance and standard deviation.

--

--

Sandeep Bansal
Analytics Vidhya

A clumsy hard working goof & a contributing Author to Analytics Vidya; A leading community of Analytics, Data Science and AI professionals