Scientific Python Introduction¶

The most important libraries for scientific computing with Python are:

  • numpy (https://numpy.org/) for vectors, matrices and higher dimensional arrays: datastructures. mathematical functions and some basic algorithms for linear algebra and fft.

  • scipy (https://docs.scipy.org/doc/scipy-1.7.1/reference/) collection of data structures and algorithms for numerical mathematics, builds upon numpy:

    • Special functions (scipy.special)
    • Numerical Integration (scipy.integrate)
    • Optimization and curve fitting (scipy.optimize)
    • Interpolation (scipy.interpolate)
    • Fourier Transforms (scipy.fft)
    • Signal Processing (scipy.signal)
    • Linear Algebra (scipy.linalg)
    • Sparse eigenvalue problems with ARPACK
    • Compressed Sparse Graph Routines (scipy.sparse.csgraph)
    • Spatial data structures and algorithms (scipy.spatial)
    • Statistics (scipy.stats)
    • Multidimensional image processing (scipy.ndimage)
    • File IO (scipy.io)
  • matplotlib (https://matplotlib.org/) plotting

  • seaborn: (https://seaborn.pydata.org/) statistical data visualization

  • pandas (https://pandas.pydata.org/): tabular data structures (data frames)

  • statsmodels (https://www.statsmodels.org/stable/index.html): statistics

  • scikit-image (https://scikit-image.org/): image processing

  • scikit-learn (https://scikit-learn.org/stable/): machine learning (no deep neural networks)

  • TensorFlow (https://www.tensorflow.org/) and PyTorch (https://pytorch.org/): deep learning

  • GeoPandas (https://geopandas.org/index.html): data frames for geo spatial data.

  • and many more

These packages don't ship with Python, so you need to install them. In case you are using Anaconda/Miniconda, you need conda install for this, in all other cases you need to use the pip command line utility, most IDEs such as PyCharm also offer this functionality within their user interface.

numpy¶

Arrays¶

!pwd
import numpy as np
/private/var/folders/k8/zfp7dvcs1m326gz1brql1tv80000gn/T/tmp.7gVLKvoQ
  • numpy arrays are used as data containers for vectors, matrices and higher order tensors.
  • all elements of an array are of the same type

We can create arrays from lists:

l = list(range(100))
print(type(l))
print(l)
<class 'list'>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
a = np.array(l)
print(type(a))
print(a)
<class 'numpy.ndarray'>
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]

Here we create a matrix from a nested list:

# 2d array
b = np.array(
    [
        [1.0, 2.0, 3.0],
        [4.0, 5.0, 6.0],
    ],
)

print(b)
[[1. 2. 3.]
 [4. 5. 6.]]

The .shape attribute prints the dimensions of the array, and the .dtype the internal data type of the elements:

print(a.shape, a.dtype)
(100,) int64
print(b.shape, b.dtype)
(2, 3) float64

The dtype can be specified during constrution:

b2 = np.array(
    [
        [1, 2, 3],
        [4, 5, 6],
    ],
    dtype=np.float16,
)
print(b2)
print(b2.dtype)
[[1. 2. 3.]
 [4. 5. 6.]]
float16

Accessing elements is done using [...] providing the indices (starting with zero):

# reading
print(b[1, 2])
6.0
# writing
b[1, 2] = 333.0
print(b)
[[  1.   2.   3.]
 [  4.   5. 333.]]

Extracting rows or columns: You can read : as "all values":

# extract column
b[:, 1]
array([2., 5.])
# extract row:
b[1, :]
array([  4.,   5., 333.])

You can add and multiply single values to/with arrays and this is done element-wise:

print(2 * b + 1)
[[  3.   5.   7.]
 [  9.  11. 667.]]

This also works for two arrays of the same shape:

# element wise:
print(b + b)
[[  2.   4.   6.]
 [  8.  10. 666.]]

Be careful: * is element-wise multiplication:

print(b * b)
[[1.00000e+00 4.00000e+00 9.00000e+00]
 [1.60000e+01 2.50000e+01 1.10889e+05]]

Instead we use @ for matrix-vector and matrix-matrix multiplication:

m = np.array(
    [
        [1, 2, 3],
        [2, 3, 4],
        [4, 5, 6],
    ]
)
x = np.array(
    [1, 1, 1],
)
print(m @ x)
[ 6  9 15]
print(m @ m)
[[17 23 29]
 [24 33 42]
 [38 53 68]]

Array creation methods¶

np.arange works like range:

# from 0 to 7 (excluded)
x = np.arange(7)
print(x)
[0 1 2 3 4 5 6]
# from 3 to 7 (excluded)
y = np.arange(3, 7)
print(y)
[3 4 5 6]
# from 7 down to 3 (excluded)
# by using negative step size:
z = np.arange(7, 3, -1)
print(z)
[7 6 5 4]

np.linspace create equi-spaced entries:

# from 1 to 2 (included!), result has 11 points:
print(np.linspace(1, 2, 11))
[1.  1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2. ]

For matrix creation numpy offers:

# 3x4 matrix
np.ones((3, 4))
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])
# 2x3 matrix
np.zeros((2, 3))
array([[0., 0., 0.],
       [0., 0., 0.]])
# 4x4 identity matrix
ii = np.eye(4)
print(ii)
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

Vectorized functions¶

numpy offers replacements for the functions from the math module which work elementwise on arrays:

np.sin(b)
array([[ 0.84147098,  0.90929743,  0.14112001],
       [-0.7568025 , -0.95892427, -0.00882117]])

Filtering / Fancy indexing¶

print(b)
[[  1.   2.   3.]
 [  4.   5. 333.]]
print(b[b % 2 != 0])
[  1.   3.   5. 333.]

Explanation:

The expression b % 2 != 0 returns an array of same shape with boolean values:

print(b % 2 != 0)
[[ True False  True]
 [False  True  True]]

And this is then passed to b[..] to filter the values. Please be aware that we get a one dimensional array back, since this filtering disrupts the original shape:

print(b[b % 2 != 0])
[  1.   3.   5. 333.]

But we can also use this to overwrite values:

b[b % 2 != 0] = 17.0
print(b)
[[17.  2. 17.]
 [ 4. 17. 17.]]

Broadcasting¶

broadcasting is a stragey in numpy to extend operations between arrays with different dimentions:

a = np.arange(1, 4)
b = 100 * np.eye(3)
print(a.shape, b.shape)
(3,) (3, 3)
a
array([1, 2, 3])
b
array([[100.,   0.,   0.],
       [  0., 100.,   0.],
       [  0.,   0., 100.]])
a + b
array([[101.,   2.,   3.],
       [  1., 102.,   3.],
       [  1.,   2., 103.]])

As you can see, the vector a was added to each row of b.

More details about broadcasting at: https://numpy.org/doc/stable/user/basics.broadcasting.html

Broadcasting can by a bit tricky, but also very powerful.

Sometimes you need to add an extra dimension, e.g. to turn a vector of size n to an n x 1 or 1 x n matrix:

# shape (3,) to (3, 1)
a[:, None].shape
(3, 1)
# now we add column-wise:
a[:, None] + b
array([[101.,   1.,   1.],
       [  2., 102.,   2.],
       [  3.,   3., 103.]])

Example¶

Finding closesest point from a given set of points. We are going to use some functions from numpy which we did not introduce yet.

We create a set of 100 random points in 3d, coordinates are in the range 0 to 100.

points = 100 * np.random.random(size=(100, 3))
print(points.shape)
(100, 3)
print(points[:4])
[[68.52935386 71.11285728 57.15705092]
 [61.74102221 51.73097121 66.62028454]
 [52.58186402 85.71699966 38.49240055]
 [35.89422478 79.29291599 84.74216476]]

We create a random "query point" for which we want to find the closest point in points.

query_point = 100 * np.random.random(size=(3,))
print(query_point)
[50.67751488 53.24989981  3.75965148]

We use broadcasting to compoute the difference vectors of the given query point to all points from points:

pairwise_diff = points - query_point
print(pairwise_diff.shape)
(100, 3)

We square these entries and sum along the horizontal axis (axis=1) which gives us (according to https://en.wikipedia.org/wiki/Pythagoras) the squared distances to the query point:

distances_squared = np.sum(pairwise_diff**2, axis=1)
print(distances_squared.shape)
(100,)

The smallest distances is

min_distance = np.min(distances_squared) ** 0.5
print(min_distance)
11.529689222318954

And this minimum is achieved for the point i from our collection:

i = np.argmin(distances_squared)
print(i)
15

Which is

closest_point = points[i, :]
print(closest_point)
[61.67676727 55.9303434   1.57667126]

Other helpful functions¶

# reshape

numbers = np.arange(24)
matrix = numbers.reshape(4, 6)
print(matrix)
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
# we don't need to compute the last dimension, since
# matrix has 24 entries which only leaves 4 for the last
# dimension, the -1 can also be used at other positions:
cube = matrix.reshape(2, 3, -1)
print(cube)
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
# transpose a matrix:
print(matrix.T)
[[ 0  6 12 18]
 [ 1  7 13 19]
 [ 2  8 14 20]
 [ 3  9 15 21]
 [ 4 10 16 22]
 [ 5 11 17 23]]
# stack vectors to a matrix:
v1 = np.arange(3)
v2 = v1 + 3
v3 = v1 + 10

v4 = np.hstack((v1, v2, v3))
print(v4)
[ 0  1  2  3  4  5 10 11 12]
m = np.vstack((v1, v2, v3))
print(m)
[[ 0  1  2]
 [ 3  4  5]
 [10 11 12]]
# array to list:
print(m.tolist())
[[0, 1, 2], [3, 4, 5], [10, 11, 12]]
# flatten matrix:
print(m.flatten())
[ 0  1  2  3  4  5 10 11 12]
# create your own vectorized function:


def transform(x):
    if x <= 1:
        return 0
    return x**0.5


# otypes is the return type, else we will cast to the dtype
# of the input array:
vectorized = np.vectorize(transform, otypes=(np.float64,))
print(vectorized(m))
[[0.         0.         1.41421356]
 [1.73205081 2.         2.23606798]
 [3.16227766 3.31662479 3.46410162]]

Pandas¶

Pandas is a package which provides a tabular data structure called DataFrame (similar to the data.frame in R).

  • This is a matrix data structure but where the columns can have different types, where as a numpy matrix must have the same time in all cells.
  • In addition the rows have an index, which per default is an interger but can be an arbitrary data type, e.g. a date.
# import conventin
import pandas as pd


# Inline, sharp plots on jupyter notebooks on Mac:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

data.frame basics¶

!head -n 6 data/brain_size.csv
"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545
  • The delimiter in the file is a ;
  • NaN is encoded as a . in the .csv file
  • The file has an index column already
df = pd.read_csv("data/brain_size.csv", sep=";", na_values=".", index_col=0)

df.head(5) 
Gender FSIQ VIQ PIQ Weight Height MRI_Count
1 Female 133 132 124 118.0 64.5 816932
2 Male 140 150 124 NaN 72.5 1001121
3 Male 139 123 150 143.0 73.3 1038437
4 Male 133 129 128 172.0 68.8 965353
5 Female 137 132 134 147.0 65.0 951545
print(df.shape)
(40, 7)
print(df.columns)
Index(['Gender', 'FSIQ', 'VIQ', 'PIQ', 'Weight', 'Height', 'MRI_Count'], dtype='object')
# pandas auto-detects types from the file:
print(df.dtypes)
Gender        object
FSIQ           int64
VIQ            int64
PIQ            int64
Weight       float64
Height       float64
MRI_Count      int64
dtype: object
print(df.index)
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35, 36, 37, 38, 39, 40],
           dtype='int64')
df.describe()
FSIQ VIQ PIQ Weight Height MRI_Count
count 40.000000 40.000000 40.00000 38.000000 39.000000 4.000000e+01
mean 113.450000 112.350000 111.02500 151.052632 68.525641 9.087550e+05
std 24.082071 23.616107 22.47105 23.478509 3.994649 7.228205e+04
min 77.000000 71.000000 72.00000 106.000000 62.000000 7.906190e+05
25% 89.750000 90.000000 88.25000 135.250000 66.000000 8.559185e+05
50% 116.500000 113.000000 115.00000 146.500000 68.000000 9.053990e+05
75% 135.500000 129.750000 128.00000 172.000000 70.500000 9.500780e+05
max 144.000000 150.000000 150.00000 192.000000 77.000000 1.079549e+06

Indexing¶

# integer based indexing:
print(df['Gender'].iloc[:3])
1    Female
2      Male
3      Male
Name: Gender, dtype: object
# access columns using a list of strings:
df[["Gender", "Weight"]].head(5)
Gender Weight
1 Female 118.0
2 Male NaN
3 Male 143.0
4 Male 172.0
5 Female 147.0
# combined indexing
df.iloc[:3][["Gender", "Weight"]]
Gender Weight
1 Female 118.0
2 Male NaN
3 Male 143.0
df[["Gender", "Weight"]].iloc[:3]
Gender Weight
1 Female 118.0
2 Male NaN
3 Male 143.0

Combined:

df.iloc[:3, [0, 4]]
Gender Weight
1 Female 118.0
2 Male NaN
3 Male 143.0
df.loc[:3, ["Gender", "Weight"]]
Gender Weight
1 Female 118.0
2 Male NaN
3 Male 143.0
# Fancy indexing:
df.loc[df.Weight > 180, ["Gender", "Weight"]]
Gender Weight
22 Male 186.0
28 Male 187.0
32 Male 191.0
33 Male 192.0
34 Male 181.0
# pure filteing, all columns, no `.loc` needed here:
df[(df.Weight > 150) & (df.Gender == "Male")]
Gender FSIQ VIQ PIQ Weight Height MRI_Count
4 Male 133 129 128 172.0 68.8 965353
10 Male 133 114 147 172.0 68.8 955466
12 Male 141 150 128 151.0 70.0 1079549
13 Male 135 129 124 155.0 69.0 924059
18 Male 100 96 102 178.0 73.5 945088
20 Male 80 77 86 180.0 70.0 889083
22 Male 97 107 84 186.0 76.5 905940
26 Male 141 145 131 171.0 72.0 935494
28 Male 103 96 110 187.0 77.0 1062462
32 Male 144 145 137 191.0 67.0 949589
33 Male 103 96 110 192.0 75.5 997925
34 Male 90 96 86 181.0 69.0 879987
40 Male 89 91 89 179.0 75.5 935863

Grouping¶

groups = df.groupby("Gender")
for key, group in groups:
    print(key)
    print(group.head(5))
    print()
Female
   Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
1  Female   133  132  124   118.0    64.5     816932
5  Female   137  132  134   147.0    65.0     951545
6  Female    99   90  110   146.0    69.0     928799
7  Female   138  136  131   138.0    64.5     991305
8  Female    92   90   98   175.0    66.0     854258

Male
   Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
2    Male   140  150  124     NaN    72.5    1001121
3    Male   139  123  150   143.0    73.3    1038437
4    Male   133  129  128   172.0    68.8     965353
9    Male    89   93   84   134.0    66.3     904858
10   Male   133  114  147   172.0    68.8     955466

df.groupby("Gender")["Weight"].mean()
Gender
Female    137.200000
Male      166.444444
Name: Weight, dtype: float64

Computing new colums¶

def is_tall(h):
    # 75 inches is ~190 cm 
    return h > 75

df["Is_Large"] = df.Height.apply(is_tall)
print(df.head())
   Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count  Is_Large
1  Female   133  132  124   118.0    64.5     816932     False
2    Male   140  150  124     NaN    72.5    1001121     False
3    Male   139  123  150   143.0    73.3    1038437     False
4    Male   133  129  128   172.0    68.8     965353     False
5  Female   137  132  134   147.0    65.0     951545     False
for key, group in df.groupby(["Gender", "Is_Large"]):
    print(key)
    print(group.describe())
    print()
('Female', False)
             FSIQ         VIQ         PIQ      Weight     Height     MRI_Count
count   20.000000   20.000000   20.000000   20.000000  20.000000      20.00000
mean   111.900000  109.450000  110.450000  137.200000  65.765000  862654.60000
std     23.686327   21.670924   21.946046   16.953807   2.288248   55893.55578
min     77.000000   71.000000   72.000000  106.000000  62.000000  790619.00000
25%     90.250000   90.000000   93.000000  125.750000  64.500000  828062.00000
50%    115.500000  116.000000  115.000000  138.500000  66.000000  855365.00000
75%    133.000000  129.000000  128.750000  146.250000  66.875000  882668.50000
max    140.000000  136.000000  147.000000  175.000000  70.500000  991305.00000

('Male', False)
             FSIQ        VIQ         PIQ      Weight    Height     MRI_Count
count   16.000000   16.00000   16.000000   14.000000  15.00000  1.600000e+01
mean   119.250000  119.68750  114.937500  160.857143  70.18000  9.496824e+05
std     26.185238   26.80726   24.593953   19.174732   2.40125  5.340396e+04
min     80.000000   77.00000   74.000000  132.000000  66.30000  8.799870e+05
25%     89.750000   95.25000   86.000000  145.000000  68.80000  9.192588e+05
50%    134.000000  126.00000  124.000000  163.000000  70.00000  9.472415e+05
75%    140.000000  145.00000  128.750000  176.500000  72.25000  9.579378e+05
max    144.000000  150.00000  150.000000  191.000000  74.00000  1.079549e+06

('Male', True)
            FSIQ         VIQ         PIQ      Weight  Height     MRI_Count
count    4.00000    4.000000    4.000000    4.000000   4.000  4.000000e+00
mean    98.00000   97.500000   98.250000  186.000000  76.125  9.755475e+05
std      6.63325    6.757712   13.720423    5.354126   0.750  6.946209e+04
min     89.00000   91.000000   84.000000  179.000000  75.500  9.059400e+05
25%     95.00000   94.750000   87.750000  184.250000  75.500  9.283822e+05
50%    100.00000   96.000000   99.500000  186.500000  76.000  9.668940e+05
75%    103.00000   98.750000  110.000000  188.250000  76.625  1.014059e+06
max    103.00000  107.000000  110.000000  192.000000  77.000  1.062462e+06

pd.pivot_table(df, index=["Gender", "Is_Large"])
FSIQ Height MRI_Count PIQ VIQ Weight
Gender Is_Large
Female False 111.90 65.765 862654.600 110.4500 109.4500 137.200000
Male False 119.25 70.180 949682.375 114.9375 119.6875 160.857143
True 98.00 76.125 975547.500 98.2500 97.5000 186.000000

Plotting¶

Just a simple example, more in the upcoming workshop!

df.boxplot(column=["Weight", "Height"])
<AxesSubplot: >
df.groupby("Gender").boxplot(column=["Weight", "Height"])
Female         AxesSubplot(0.1,0.15;0.363636x0.75)
Male      AxesSubplot(0.536364,0.15;0.363636x0.75)
dtype: object

Performance of pandas¶

  • Pandas is fast and optimized for column-wise operations
  • Row wise operations, such as iterating over rows and appending new rows are slow!!!!

Not covered¶

  • E.g. merge, join, concat,
  • time series data, r
  • reshaping
  • working with missing data,
  • Read / write database tables with pandas in different formats: pandas.read_sql and DataFrame.to_sql, pandas.read_excel, DataFrame.to_excel, hdf5, writing LaTeX.
  • A lot more , see https://pandas.pydata.org/docs/user_guide/index.html

Solving the goat problem with scipy*¶

This is an example demonstrationg scipy (https://docs.scipy.org/doc/scipy/) to solve https://en.wikipedia.org/wiki/Goat_problem#Interior_grazing_problem

Let $P$ be the center of a unit circle. A goat/bull/horse is tethered at point $Q$ on the circumference. How long does the rope $r$ need to be to allow the animal to graze on exactly one half of the circle's area?

The $r$ we are looking for fulfils:

import numpy as np


def f(r):
    return (
        r**2 * np.arccos(r / 2)
        + np.arccos(1 - 0.5 * r**2)
        - 0.5 * r * np.sqrt(4 - r**2)
        - np.pi / 2
    )
import scipy.optimize
# find zero of f(r) starting with guess r=0.5:
scipy.optimize.fsolve(f, 0.5)
array([1.15872847])
from matplotlib import pyplot as plt
r = np.linspace(0, 2, 100)
print(r.min(), r.max(), r.shape)
0.0 2.0 (100,)
plt.plot(r, f(r))
plt.plot([0, 2], [0, 0], ":")
plt.grid()

Alternative approach via integration:

# adaptive numerical integration:
from scipy.integrate import quad
def integrand(t, r):
    return np.sqrt(r**2 - t**2) + np.sqrt(1 - t**2) - 1


def f(r):
    integral, error_estimate = quad(
        integrand, 0, np.sqrt(r**2 - r**4 / 4), args=(r,)
    )
    return integral - np.pi / 4
# find zero of f(r) starting with guess r=0.5:
scipy.optimize.fsolve(f, 0.5)
array([1.15872847])

See also https://scipy-lectures.org/ and https://jakevdp.github.io/PythonDataScienceHandbook/

https://www.oreilly.com/library/view/introduction-to-machine/9781449369880/