EDA: Penguins
Getting setup
For a more realistic example of shellplot in a data science usecase, we will walk through an exploratory data analysis (EDA) of the penguins data.
We first import pandas and shellplot and set the pandas plotting backend:
>>> import pandas as pd
>>> import shellplot as plt
>>> pd.set_option("plotting.backend", "shellplot")
For convenience, the penguins dataset can be directly loaded from shellplot:
>>> df = plt.load_dataset("penguins")
>>> df.sample(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Gentoo Biscoe 47.5 15.0 218.0 4950.0 FEMALE
Adelie Dream 38.9 18.8 190.0 3600.0 FEMALE
Gentoo Biscoe 47.2 15.5 215.0 4975.0 FEMALE
Gentoo Biscoe 46.4 15.6 221.0 5000.0 MALE
Adelie Biscoe 36.4 17.1 184.0 2850.0 FEMALE
Exploring features
Histograms offer a nice way to explore numeric features. For example, we can plot the distribution of penguin body masses:
>>> df["body_mass_g"].hist(bins=10)
counts
72┤ -----
| | |
| | |
| | |
| | |
| | |
54┤ | |-----
| | | |
| | | |
| -----| | |
| | | | |----- -----
| | | | | | |
36┤ | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |----- -----
| | | | | | | | |
| | | | | | | | |
18┤ | | | | | | | |
| -----| | | | | | | |-----
|| | | | | | | | | |
|| | | | | | | | | |
|| | | | | | | | | |-----
|| | | | | | | | | | |
0┤| | | | | | | | | | |
└┬-----------┬-----------┬-----------┬-----------┬-----------┬---------
2700 3420 4140 4860 5580 6300
body_mass_g
Boxplots provide a nice way to visualize multiple distributions at once:
>>> df.boxplot(column=["bill_length_mm", "bill_depth_mm"])
|
|
| ---
|| | | | |
|| | | | |
|| | | | |
bill_depth_mm┤|---| | |---|
|| | | | |
|| | | | |
|| | | | |
| ---
|
|
|
| ------------
| | | | | |
| | | | | |
| | | | | |
bill_length_mm┤ |----------| | |---------------|
| | | | | |
| | | | | |
| | | | | |
| ------------
|
|
└┬------------┬------------┬-------------┬------------┬------------┬---
13 22 31 40 49 58
For categorical features, bar plots can be useful. We can check which penguin species are found on which islands in the data:
>>> df[["species", "island"]].value_counts().plot.barh()
|
|------------------------┐
| |
('Adelie', 'Biscoe')┤ |
| |
|-----------------------------┐
| |
('Adelie', 'Torgersen')┤ |
| |
|-------------------------------┐
| |
('Adelie', 'Dream')┤ |
| |
|--------------------------------------┐
| |
('Chinstrap', 'Dream')┤ |
| |
|--------------------------------------------------------------------┐
| |
('Gentoo', 'Biscoe')┤ |
| |
|--------------------------------------------------------------------
└┬-------------┬-------------┬------------┬-------------┬-------------┬
0 25 50 75 100 125
Multivariate plots
Next, we’ll have a look at how features vary across certain categories in the data.
For example, we can analyse how the distribution of bill lengths varies across the three penguin species:
>>> df.boxplot(column=["bill_length_mm"], by="species")
species
|
|
| ---------
| | | | | |
Gentoo┤ |----------| | |------------------------|
| | | | | |
| ---------
|
|
|
| -----------
| | | | | |
Chinstrap┤ |------------| | |----------------|
| | | | | |
| -----------
|
|
|
| ---------
|| | | | |
Adelie┤|-----------| | |-----------|
|| | | | |
| ---------
|
|
└┬--------------┬--------------┬-------------┬--------------┬----------
32 38 44 50 56
bill_length_mm
We can also explore combinations of features. Let’s start by looking at both the bill and flipper lengths vary across species:
>>> plt.plot(df["bill_length_mm"], df["flipper_length_mm"], color=df["species"])
flipper_length_mm
232┤ o o o o o o o
| ooo o o o
| o ooo
| o oooo
| o oo o oo ooo oooo o
| o o o o o o oo o o
217┤ o o oo oo oo o
| o ooo oooooooooo o
| oo o o o
| + ooooooo ooooo o oo* * *
| o * *
| + o * ** * *
202┤ + + * ** ** *
| + ++++ ++ ++ * * * **** *
| +++ ++++ + *+ ** *****
| + +++++ ++ +++++++ + + * * * *** * **
| + + + ++ + ++++ + *+*** *
|+ + ++ ++++ +++++++++ + **** *
187┤ + + + + ++ +++ +* * * * *
| ++ + +++++ + + + *
| + ++ + ++ * *
| + + + +
| + + + + * + Adelie
| + * Chinstrap
172┤ + o Gentoo
└┬--------------┬--------------┬-------------┬--------------┬----------
32 38 44 50 56
bill_length_mm
To be continued!