At Spry Group, we like to use the right visualization tool for the right project. Sometimes charts like Sankey diagrams or sunburst charts are needed to visualize relationships between intricate systems. Other times, it’s best to go simple.
When it comes to understanding the key findings from statistical datasets, we recommend the box plot. Also known as the box chart or whisker plot, this chart visually lays out data by quartile and plots outliers. When working on several experiments, it’s key for visualizing the changes in multiple data points.
Let’s take a look at how box plots are used today.
How the Box Plot Works
Box plots lay out the distribution of data sets so users can visualize means and outliers. Data is broken into quartiles, with the 75th and 25th percentiles making up the box component of the chart.
Let’s walk through an example to see the box plot come together. Below we have a chart of January temperature lows for City A:
Sample of Data Set
The resulting box plot looks like this:
The lower border of the box represents the 25th percentile (4.75 degrees) while the upper border represents the 75th percentile (19.25 degrees). Dividing the box is a line representing the median, or 50th percentile (9.5 degrees). In our chart, the mean is noted by the “X” (11.5 degrees).
The outer lines (whiskers) in this box plot represent the minimum and maximum data points (25 and -5 degrees). Whiskers can also represent other data points, including:
- 2nd and 98th percentiles
- 9th and 91st percentiles
- One standard deviation above and below the data’s mean.
Finally, box plots can also include outliers. In our chart, we can see one January day had a low temperature of 42 degrees.
Multiple Box Plots
The advantage with box plots come when comparing similar sets of data to each other, such as findings from replicated experiments or test scores across multiple schools.
Let’s see how January lows for City A compare to Cities B and C:
Rather than look at 90 days’ worth of temperature data, the box plots allow us to visualize basic statistical information to draw conclusions. City C had the least volatile temperature lows of the three cities. While City B had wide fluctuations in daily lows, temperatures overall were higher than City A. Although City A had the coldest temperatures, it was also the only one to have an unusually warm day.
The Evolution of the Box Plot
The box plot’s origins go back to the range-bar, first used in the 1950’s. This bar graph variation shows the range of a dependent variable across a data set:
Source: NSCU
In the above chart, we can see the temperature ranges (the dependent variable) across the entire month.
Mathematician John Wilder Tukey took this visualization one step further to incorporate statistical references for the entire data set. He first used the box plot as one of his tools for exploratory data analysis during the 1970s and formally introduced it in his 1977 book “Exploratory Data Analysis.” Since then, the box plot has remained one of the most common statistical graphics used today.
Despite its popularity, Tukey’s box plot does have it’s shortcomings. What if the data sets for comparison vary widely? How do we know if the means are statistically significant?
The variable width box plot answers the first question. The width of each box plot is adjusted proportionately to their data group size. Typically the width is the square root of the data set’s size (for example, a data set of 100 would be represented by a width of 10).
Source: OriginLab
In the above chart, the width of each box corresponds to the number of vehicles in each region. Given the high population density of the East Coast, it makes sense to have the highest vehicle volume.
This chart also has another difference: the narrowing of the width around each median. That’s because this chart is also a notched box plot and answers our second question by showing the confidence intervals around the medians of each data set. The 95 percent confidence interval is most commonly used.
The notch range is typically calculated by the following:
- Median +1.57 X interquartile range / square root of n
- Median - 1.57 X interquartile range / square root of n
Let’s look at the Midwest chart as an example. If the notch was calculated using the above formulas, we can be 95 percent sure the Midwest’s true median falls within the notch.
Common practice is that if the notches between different box plots do not intersect, there is evidence that their medians are statistically different. In the above chart, we can infer that the Gulf Coast and West Coast medians are statistically different from all other regions.
Turn to Spry Group for Your Data Visualization Needs
While box plots are great for high-level overviews of data sets, they only provide a glimpse into the insights you can pull from your data. The key is using the right combination of visualization tools for the job. That’s where the Spry Group team comes in. From carbon emissions tracking to power generation management, our expertise in data visualization and sustainability makes it easy to align green efforts with financial impact. Together we can help your clients and employees better visualize your big data. Let us help you next.