Published on

How to create an Age Distribution Graph Using Python, Pandas and Seaborn

Have your ever wondered how to create an age distribution graph using Python, Pandas and Seaborn? If so, keep reading in order to find out how!

Figure 1: Here the graph we'll learn to build in this tutorial

Figure 1: Here the graph we'll learn to build in this tutorial

Setup

First, here is the GitHub repo for this tutorial: Kaggle Titanic Project

We'll be working with the contents in the file age-distribution-graph.ipynb for this tutorial.

Note: We'll be working with Jupyter Notebook for this tutorial so if you don't have it installed you can do so in the official Jupyter website

Development

After opening up age-distribution-graph.ipynb you'll notice that the code is divided up into blocks that can be run individually.

Let's go through each code block one by one:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
warnings.filterwarnings("ignore")

Here we are importing all the neccessary libraries for constructing the Histograph that we're about to build. We'll be using Seaborn to create the Histograph using its histplot method(more on that method in their docs page) The warnings.filterwarnings("ignore") line is making sure to never print warnings that match an ordered list of filter specifications(more on warnings.filter() in their official docs page)

Next, we add the following code block:

def read_data():
    train_data = pd.read_csv("data/train.csv")
    test_data = pd.read_csv("data/test.csv")
    return train_data, test_data

train_data, test_data = read_data()

Here we're defining the read_data() method, which is responsible for loading the data contained in a .csv file into a Pandas DataFrame object(more on DataFrame in their official docs). Now the train_data variable contains the training data and the test_data variable containing the testing data.

Next we can add the following code:

def survived_age_table(feature):
    sns.histplot(data=train_data, x='Age', hue='Survived', palette=['yellow', 'green']).set_title(f"{feature} Vs Survived")
    plt.legend(labels=['Died', 'Survived'])
    plt.show()

This method is responsible for creating the age distribution graph. Here are some more details about it:

  • First we create the histogram by calling the method sns.histplot()(more on this method can be found in their official docs).
  • The data parameter takes an input data structure, which is a pandas.DataFrame in our case.
  • The x parameter specifies the variable subject to being counted, which in this case is the Age variable. Assigning a variable to the hue parameter, Survived in our case, would be an instance of conditional subsetting, whereby a seperate histogram containing its own unique values and colors will be rendered in the same graph.
  • The palette parameter is a way to choose the colors to use when mapping the hue variable.
  • Finally, we can set the title of the histogram via set_title()
  • The plt.legend() method is a way to customize the legends displayed in the legend box located in the top right of the histogram.
  • Lastly, plt.show() displays our histogram.

And here is our finished histogram:

Figure 2: Our Finished Histogram

Figure 2: Our Finished Histogram

Thanks for following along and I hope this article was helpful to you.

Conclusion

Well that's it for this post! Thanks for following along in this article and if you have any questions or concerns please feel free to post a comment in this post and I will get back to you when I find the time.

If you found this article helpful please share it and make sure to follow me on Twitter and GitHub, connect with me on LinkedIn and subscribe to my YouTube channel.