Skip to main content

The misadventures of a junior Data Analyst

In this article, we’d like to highlight how getting relevant data from your system is not the end of the journey for a complete Data Analysis. Once you have your data, what do you do with it? How do you organize it and present it in a meaningful way, so that you and other people can gain some actual value from it?

A part of the Data Science process, Data Visualization is a vital part of data analysis but it is, unfortunately, often overlooked and underestimated. Even if the data we are representing is correct, the way we choose to build a graph or chart impacts the way data is perceived. Favoring one representation over the other can also lead the observer to one conclusion that may not be totally correct.

As an example, today we’d like to follow the work of an inexperienced data analyst: Bob.

Introducing Bob

Bob works for a made-up hospital in Boston, Massachusetts.

The hospital has collected data on its patients and the reasons for admission (encounters) and medical treatment (procedures) they received at the Hospital.

Task: Bob is given the dataset*, and he is tasked with presenting this data to the management board.

His main objectives are:

  • Give an overview of the patients’ age, gender and County/City
  • Analyze the type of encounters
  • Analyze the number of procedures done in 2020 to highlight any downwards trend
  • See if there is any evidence to support the investment in an Asthma Program at the Hospital

*The dataset is available here (Hospital Patients Records) and all analysis and visualizations have been done using excel, as this is a tool generally available to everyone.

Bob’s Analysis

Data Analysis – Patients’ data

Let’s see what Bob does and how he chooses to present the data.


Patients divided by gender

Bob starts by organizing his data, we assume that this part is done correctly. He then starts his analysis on the patients’ gender, and he produces this chart:
He is happy to immediately notice that female patients are much more prominent than male ones.

Patients divided by age and gender

He decides to further investigate this fact and to see if there’s any correlation between the patients’ gender and age. To investigate this, he produces this chart:
Bob is a little dejected: he doesn’t really understand this graph and the relation between age and gender it depicts. He decides not to include it, as he doesn’t know how to explain it.

Patients divided by country and city

After gender and age, he then decides to take into consideration the patients’ geographical data. He splits patients based on their home County and City. This is the graph he puts in his presentation:
He is happy with how the graph clearly shows that the vast majority of patients come from Boston.

Bob’s Analysis

Data Analysis – Encounters and procedures

Encounter types

After providing this initial data, he then focuses his analysis on the patients encounters, which are the ways patients access the Hospital services. He divides the encounters by type:

Number of procedures

Satisfied with this chart, he then moves to the analysis of the number of procedures the Hospital did in 2020. His supervisor asked him to particularly focus on this year, as the Hospital experienced a decrease in the number of procedures and management wants to know how the trend is changing.

Bob looks at the data for 2020 and then decides to focus on Q2 and Q3, as he believes this highlights the downward tendence that is worth investigating:


Investigating Asthma Trends and Fabrics

Finally, Bob tackles the topic of the new Asthma Program. He starts looking at data and trends online, deciding to investigate asthma prevalence in US children. Bob thinks that fabrics used in clothes used daily by patients can be connected to Asthma insurgence, and after days of research he finds online data that show how the decline of GMO use in cotton crops in Mississippi is reducing the prevalence of Asthma in children (Chart by Tyler Vigen).


At this point, he presents this data to his supervisor to get first feedback.

What do you think of Bob’s findings so far?

TeamPeaks Analysis

The key point we want to highlight is that Bob’s charts are, formally, correct. Which means, the data presented is correct and has not been altered. But are these charts useful? Are they easy to read, do they offer useful and correct insights?

Let’s look again at them and use some critical thinking:

TeamPeaks Analysis

Data Analysis – Patients’ data


Patients divided by gender

( click the buttons to discover more )

This chart has two main issues:
  1. It uses a truncated scale: data on the vertical axis does not start from a 0 baseline, as it should. Omitting baseline and truncating scale might indicate false patterns or even trends that do not exist.
    In this specific example, the graph exaggerates the difference between male and female patients

  2. Using the same color for the two bars means that whoever watches this chart needs to look at the data labels (which are just letters) to understand which is which. Using a known color coding, especially in cases where the colors are established (blue/pink), makes it so that the chart can become much easier to read.

Let’s see how this chart can be improved:

This chart is more immediate and the difference between male and female patients is represented more truthfully. It could also be argued that a chart is not really needed: presenting the data in text form is enough. After all, not everything needs to be a chart! But, at least, the latter chart is better than the initial one.


Patients divided by age and gender

What about the relationship between age and gender, which Bob was not able to decipher?

This is an example where you really need to pick the correct chart to interpret your data.

Let’s see how the same data looks when using two chart types more suited for the analysis Bob is doing: a scatterplot or, if the target audience of the presentation has some familiarity with charts, a box plot.

While the representation initially chosen by Bob (Don’t) is not immediate to read, both of the visualizations proposed here (Do) highlight a clear trend: women tend to go to the hospital regularly at all ages while, for men, there is a tendency to go to the Hospital more as their age increases. This is useful information, which was lost in the initial presentation. Data like this could suggest a campaign aimed at men, to encourage them to visit the Hospital more frequently.


Patients divided by country and city

Let’s now look at the analysis on the patients’ county and city.

While it is true that the pie chart clearly shows that the majority of patients come from Boston, the rest of the chart is quite unreadable. The sections are too many and too little to clearly distinguish. Also, given the numerosity of the sections, the chart ends up repeating colors, which further complicates its interpretation. Another thing that is lost is the hierarchical relationship between County and City. One can only grasp that by reading the legend, which is complicated by the sections being so many and so close in color.

Let’s look at some more efficient ways to display data for these two analyses. Starting with the County/City representation, a bar chart is easier to read while still maintaining all the categories and the County/City hierarchy:

Have a look at the Chart A:

Still, the difference in size between Boston and the other cities is too big and the chart is difficult to read. What about splitting this information? Remember, not everything needs to be represented with a chart!

If clearly stated, the chart could include information excluding Boston data. This information can be put in a separate blurb (Chart B).

Separating this information allows also for other types of graphs, like a map chart. Again, it is important to clearly state that Boston has been excluded (Chart C).

TeamPeaks Analysis

Data Analysis – Encounters and procedures


Encounter types

( click the buttons to discover more )

As a general rule, pie charts are not optimal, as we do not easily distinguish the angles’ size. This is further exacerbated when using 3D representations: take into consideration the second pie chart Bob presented: even if there are only 6 sections, can you easily tell the difference between categories “wellness”, “inpatient”, “urgentcare” and “emergency”?

As for Bob’s second chart (encounter types), most of the time a bar chart is much easier to read than a pie chart. In addition, as a general rule, avoid 3D representations, as they tend to distort charts.


Number of procedures

Let’s now consider Bob’s next chart: he wanted to highlighting a decrease in the number of procedures performed at the end of 2020 Q3.
While the chart was correct, restricting the time period too much may lead to incorrect assumptions about trends and performance.
Let’s look at the same chart for all 2020, not only Q2 and Q3: While the decrease at the end of Q3 is still present, the chart clearly shows a big increase in the number of procedures right after that. Excluding this information may not have been intentional, but it surely leads to wrong assumptions about the Hospital’s performance and future outlook.

Investigating Asthma Trends and Fabrics

And what about the last chart? Bob wanted to show the correlation of GMO used in cotton crops and Asthma Prevalence. This was a complete error. There is no scientific evidence proving this correlation. Bob did a very basic error for a Junior Data analyst: correlation does not mean causation! In fact, there are thousands of similar correlations that could be built: check for instance how closely the distance between Uranus and Earth and Asthma prevalence in American children correlate to each other:

Both correlation charts, courtesy of Tyler Vigen, are available at Spurious Correlations, a site dedicated to collecting data that seem to be related but are, in fact, independent from each other.

Let’s wrap up

By summarizing 10 important points to keep in mind to avoid errors in data visualization (and to help Bob perform better next time):

  1. Always focus on your target audience and what your point is.
  2. Baseline and scale: it is generally best to use visualizations with a zero-baseline y-axis.
  3. Focus on a selection of colors and don’t ignore consolidated visual associations.
  4. Use charts when they really add value to your analysis.
  5. Select the best chart type for each visualization.
  6. Avoid 3D representations: most of the times 2D is enough.
  7. Be mindful of the volume of data used in a chart and remember that multiple visualizations can help communicate data more efficiently.
  8. Use clear and unbiased text.
  9. Be honest and avoid bending the charts to suit your expectation.
  10. Correlation is not causation.