Scientific Python intro: Pandas¶

Last update: 30. Jun 2019

Pandas: Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more https://pandas.pydata.org

[https://github.com/pandas-dev/pandas]

Pandas data frames are:

a non-homogeneous tabular data structure,
built using NumPy arrays, so also vectorized and (C) fast, but also lazy,
convenient to transform, summarize, and plot.

Prerequisites¶

Python 3
Python IDE
Project folder w/ virtual environment set up
Pandas + Matplotlib:
```
$ pip install pandas matplotlib
```

import pandas as pd # Pandas import convention

# Inline, sharp plots
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Intro¶

"Tidy"/statistical data = 2D table (matrix), where:

rows contain observations or samples
columns contain attributes, or features, or variables

# Brain size and weight and IQ data (Willerman et al. 1991)
# Downloaded from Scipy Lecture Notes on 30. Jun 2019:
# http://scipy-lectures.org/_downloads/brain_size.csv
#
!head -n 6 data/brain_size.csv

"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545

df = pd.read_csv("data/brain_size.csv", sep=";", na_values=".", index_col=0)

df.head(5)

print(df.shape)
print()
print(df.columns)
print()
print(df.dtypes)
print()
print(df.index)

(40, 7)

Index(['Gender', 'FSIQ', 'VIQ', 'PIQ', 'Weight', 'Height', 'MRI_Count'], dtype='object')

Gender        object
FSIQ           int64
VIQ            int64
PIQ            int64
Weight       float64
Height       float64
MRI_Count      int64
dtype: object

Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35, 36, 37, 38, 39, 40],
           dtype='int64')

Side note: Pandas has a great date-time support, including indexing using date-time ranges or groups. See, e.g., https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html#Pandas-Time-Series:-Indexing-by-Time

Indexing and slicing¶

Pandas data frames can be used as Python dictionaries, with keys being column names:

print(df['Gender'][:3])
print()
print(df.keys())

1    Female
2      Male
3      Male
Name: Gender, dtype: object

Index(['Gender', 'FSIQ', 'VIQ', 'PIQ', 'Weight', 'Height', 'MRI_Count'], dtype='object')

# access col as attribute
print(df.Gender[:3])
print()
# access multilple cols
print(df[["Gender", "Weight"]][:3])
print()

1    Female
2      Male
3      Male
Name: Gender, dtype: object

   Gender  Weight
1  Female   118.0
2    Male     NaN
3    Male   143.0

Beware: slicing (with integers) using indexing operator [] goes over rows, not columns:

# NOTE: our data (row) index is subsequent integers, but starts at 1, not at 0 as in Python
df[1:3] # 1:3 == [1, 2]

To avoid confusion, for slicing and indexing using row (and columns) names or indices best use, respectively, loc or iloc properties:

print(df.loc[:3, "Gender":"Weight"]) # use "labels", so :3 == [1, 2, 3]
print()
print(df.iloc[:3, :-2]) # use Python indexing, so :3 == [0, 1, 2]
print()
print(all(df.loc[1] == df.iloc[0]))

   Gender  FSIQ  VIQ  PIQ  Weight
1  Female   133  132  124   118.0
2    Male   140  150  124     NaN
3    Male   139  123  150   143.0

   Gender  FSIQ  VIQ  PIQ  Weight
1  Female   133  132  124   118.0
2    Male   140  150  124     NaN
3    Male   139  123  150   143.0

True

Fancy indexing¶

Works as in NumPy:

print(df.loc[df.Weight > 150, "Gender"])

4       Male
8     Female
10      Male
12      Male
13      Male
14    Female
18      Male
20      Male
22      Male
26      Male
28      Male
30    Female
32      Male
33      Male
34      Male
36    Female
40      Male
Name: Gender, dtype: object

Note: w/o loc, bool masks index by rows, same as in slicing

len( df[(df.Weight > 150) & (df.Gender == "Male")] )

13

Pandas objects¶

Series represents a single column/vector/data series.
DataFrame consists of multiple Series (is a dictionary/list of row/column Series).
DataFrameGroupBy is created by grouping DataFrame rows by values of one of the column Series.

df_short = df.iloc[:2, 4:6]

print(df_short)
print(type(df))

   Weight  Height
1   118.0    64.5
2     NaN    72.5
<class 'pandas.core.frame.DataFrame'>

print(df_short["Weight"])
print(type(df_short["Weight"]))
print()
print(df_short.loc[1])
print(type(df_short.loc[1]))

1    118.0
2      NaN
Name: Weight, dtype: float64
<class 'pandas.core.series.Series'>

Weight    118.0
Height     64.5
Name: 1, dtype: float64
<class 'pandas.core.series.Series'>

df_bygender = df[["Gender", "Weight", "Height"]].groupby("Gender")
df_bygender.describe()

Transforming and summarizing¶

See Data Wrangling with pandas Cheat Sheet for a great visual overview of ways of how tidy data frames can be transformed and summarized.

Notably, transfromations are lazy - they won't execute until actually needed.

Grouping¶

df_bygender = df.groupby("Gender")
print(df_bygender)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11436fa20>

Compare mean brain weight with mean weight per gender:

print(df.Weight.mean())
print()
print(df_bygender.Weight.mean())

151.05263157894737

Gender
Female    137.200000
Male      166.444444
Name: Weight, dtype: float64

Grouping is simply a collection of splitted data frames:

for gender, gender_df in df_bygender:
    print(gender)
    print(gender_df.describe())
    print()

Female
             FSIQ         VIQ         PIQ      Weight     Height     MRI_Count
count   20.000000   20.000000   20.000000   20.000000  20.000000      20.00000
mean   111.900000  109.450000  110.450000  137.200000  65.765000  862654.60000
std     23.686327   21.670924   21.946046   16.953807   2.288248   55893.55578
min     77.000000   71.000000   72.000000  106.000000  62.000000  790619.00000
25%     90.250000   90.000000   93.000000  125.750000  64.500000  828062.00000
50%    115.500000  116.000000  115.000000  138.500000  66.000000  855365.00000
75%    133.000000  129.000000  128.750000  146.250000  66.875000  882668.50000
max    140.000000  136.000000  147.000000  175.000000  70.500000  991305.00000

Male
             FSIQ         VIQ         PIQ      Weight     Height     MRI_Count
count   20.000000   20.000000   20.000000   18.000000  19.000000  2.000000e+01
mean   115.000000  115.250000  111.600000  166.444444  71.431579  9.548554e+05
std     24.986312   25.640993   23.540335   20.047656   3.283131  5.591135e+04
min     80.000000   77.000000   74.000000  132.000000  66.300000  8.799870e+05
25%     89.750000   95.250000   86.000000  148.750000  68.900000  9.195292e+05
50%    118.000000  110.500000  117.000000  172.000000  70.500000  9.472415e+05
75%    139.250000  145.000000  128.000000  180.750000  73.750000  9.734960e+05
max    144.000000  150.000000  150.000000  192.000000  77.000000  1.079549e+06

Plotting¶

A lot of default ready-to-go Matplotlib plots are available via plot property of Pandas objects:

df.plot.scatter("Weight", "Height")

<matplotlib.axes._subplots.AxesSubplot at 0x11446bd30>

... or in pandas.plotting module:

import matplotlib.pyplot as plt
from pandas import plotting

df_iqs = df[['PIQ', 'VIQ', 'FSIQ']]

plt.figure();
plotting.scatter_matrix(df_iqs);
#plt.show() # Use in script

plt.figure();
plotting.boxplot(df_iqs);
#plt.show() # Use in script

<Figure size 432x288 with 0 Axes>

.. or directly on the DataFrameGroupBy objects:

df_bygender.boxplot(column=["FSIQ", "VIQ", "PIQ", "Weight", "Height"])

Female         AxesSubplot(0.1,0.15;0.363636x0.75)
Male      AxesSubplot(0.536364,0.15;0.363636x0.75)
dtype: object

	Gender	FSIQ	VIQ	PIQ	Weight	Height	MRI_Count
1	Female	133	132	124	118.0	64.5	816932
2	Male	140	150	124	NaN	72.5	1001121
3	Male	139	123	150	143.0	73.3	1038437
4	Male	133	129	128	172.0	68.8	965353
5	Female	137	132	134	147.0	65.0	951545

	Weight								Height
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
Gender
Female	20.0	137.200000	16.953807	106.0	125.75	138.5	146.25	175.0	20.0	65.765000	2.288248	62.0	64.5	66.0	66.875	70.5
Male	18.0	166.444444	20.047656	132.0	148.75	172.0	180.75	192.0	19.0	71.431579	3.283131	66.3	68.9	70.5	73.750	77.0