One of the first things I learned when taking my first statistical programming course was how to calculate summary statistics.
Summary statistics are used to summarize the distribution of data. Depending on the type of data, the stats used to summarize can vary but examples may include the average, the median, the coefficient variation —to name a few.
Another example is the 5-number summary that makes up a boxplot:
- Minimum: the smallest number in your data set
- The first quartile: 25% of the data is below this
- The second quartile: 50% of the data is below this
- The third quartile: 75% of the data is below this
- Maximum: the largest number in your data set
While summarizing the distribution can help us process large amounts of data, the exploration of the data shouldn’t stop there.
To illustrate why, we’ll be using Anscombe’s Quartet.
Anscombe’s Quartet – with identical summary statistics
What is Anscombe’s Quartet? It comprises of 4 data sets with *identical* summary statistics.
But when the data sets are graphed, their distributions appear very different, as you can see here:
So what should you do?
As a visual person, I try to explore my data in different views by making visualizations. Examples include: scatterplot, histograms, heatmaps, correlation matrices, and more.
With visualization softwares and libraries being so common and accessible, it’s much easier for us to create many types of visualizations — sometimes with just a click of the button. So don’t hesitate to make multiple variations and versions of visualizations.
Depending on the data you are working with and your project agenda, you can run additional inferential statistics.
Throughout this process, review the data critically. This includes being aware of the limitations and biases that are embedded within the data and making sure you document and call this out when sharing any insight from this data to others.
Based on what you find, it’s possible you will decide to pursue a different data source for your project. That means you will have to start this exploratory analysis from the beginning again, and that’s okay. This is all part of the data analysis process.