How to read a histogram, min, max, median & mean
Datawrapper offers powerful tools to understand numeric data you uploaded. If you are in step 2: Describe, you can click on the header of a column with numbers, to display a histogram, the min, max, median, mean, and the number of potential invalid values. Here's a quick explanation what all of these mean. Scroll down to get more in-depth explanations with examples.
- Value distribution (histogram): Shows how the values in your column are distributed. The higher the bar, the more values fall in a range.
- Min & Max: Shows you the lowest (Min) and the highest (Max) value in your column.
- Mean: Also called "average": Sums up all the values in your column and divides them by the number of values.
- Median: Gives you the value that would be in the middle of an ordered list of your values. Ignores outliers.
Let's introduce some sample data. We have 100 data rows in each column. "Berries" is filled with numbers from 0 to 100. "Lemons" is filled with numbers from 0 to 95 and with five "5000"s. "Apples" is filled with random data between 0 and 100.
Value distribution (histogram)
The histogram is a chart that tells us how our values are distributed in the column we selected. This is great to understand which range of values occurs most and least: Which salaries are most common; which survey replies were chosen the least or which range of unemployment rates most counties have to deal with.
Here's how it works: A histogram automatically creates even ranges between our lowest and highest values and tells us how many of our values fall within these ranges. The higher the bar, the more values fall in a range.
That's the histogram of our random data (Apples):
We can see that most of the values in this column fall between 80 and 90.
Attention: The values 80 and 89.99999 count as part of this range between 80 and 90. But the value 90 counts as part of the last range; the one between 90 and 100. This is true for all values except the highest value, 100. To not create a new range just for this one value, 100 counts as part of the 90 to 100 range. We can see this better when we ask Datawrapper to show us the histogram for Berries:
Berries contains 101 values. The values 0 to 9.999 are in the first range; the values 10 to 19.999 are in the second range and the values 90 to 99.9999 plus the value 100 are in the last range. You can hover over the individual bars to check how many values fall in the range.
To understand that a histogram creates even ranges between the lowest and the highest value, let's look at the Lemons column. Because of its five outliers (5 times 5000), all the smaller values between 0 and 100 are in a 0-500 range:
Min & Max
The Min and the Max tell you the lowest and the highest value in your column. This is pretty straightforward: The Min for Berries and Lemons is zero; the Max for Berries is 100 and the Max for Lemons is 5000.
Mean & Median
The Mean is the average that most of us are familiar with. It gets calculated like this: First, we sum up all our values. In the case of Berries, that's 0+1+2+3+4+5+...+99+100, so that's 5050. Then, we divide them by the number of values. We have 101 values in our Berries column, and 5050 divided by 101 is 50. That's our mean: 50.
Calculating the Median is even simpler: We sort all our values in our head from low to high (1, 2, 3, 4, 5, ..., 99, 100). Then we check the value in the middle. This value is our median. In the case of our berries, that's 50: There is the same number of values before 50 as there is afterward.
For our Berries column, the mean and the median are the same numbers: 50. That's because the values in this column are very even, with exactly the same intervals between the numbers. Let's look at the Lemons column instead:
Our Mean got six times bigger (!), but the median is still 50. That's because it doesn't matter for the median if the values above the middle value are very close to it or thousands of numbers away. The median just counts how many values there are, sorts them and then checks which value the one in the middle has. That's why the Median is a useful measure: The Median ignores outliers.
This can be useful for many reasons, e.g.when we look at salaries: Let's imagine that a company has 101 employees, each one earning between nothing and 95 Dollar. And then they are five bosses, earning 5000 Dollar each. It wouldn't be fair to say that every person in this company makes 300 Dollar (the mean). It would be closer to the truth to say that everyone earns 50 Dollar. When it comes to salaries, he Median can literally tell us what the "man in the middle" makes.
ProTip: You can hover over the values for Min, Max, Mean and Median to display them in the histogram:
"Invalid values" is a number that we only show you when you actually have invalid values. Most often, these are letters or words hidden in your column of numbers. Datawrapper tells you the absolute number of invalid values. It also shows you how much percentage of the whole column is made up of invalid values. This share can help you decide if your data is unusable or not.
Here's the Berries column with six invalid values:
We hope that this tutorial and the information we show in and below the histogram helps you understand your data better. If you still have questions, don't hesitate to go in touch with us at firstname.lastname@example.org.