Python eco system for data analysis

  • numpy: provides data containers for linear algebra, eg vectors and matrices + basic linear algebra algorithms
  • matplotlib: provides plotting routines
  • scipy: common algorithms on vectors and matrices, eg statistics and routines from numerical analysis
  • so called scikits for special topics, eg scikit.learn for machine learning, others for time series analysis, image analysis, statistics, etc
  • pandas tries to mimic the R data.frame type. In contrast to numpy containers, every column has its own type. http://pandas.pydata.org/pandas-docs/stable/10min.html#min

Before we start:

  1. check if import numpy, import scipy and import matplotlib work on your machine.
  2. if this works: fine, skip the next steps !
  3. Else read http://www.scipy.org/install.html

Introduction to numpy

numpy: Python module with efficient data structures for vectors and matrices.

If you are not familar with vectors: like a list of floating point or integer numbers, all having the same type.

Before we start we import numpy and give it a shorter name:

import numpy as np
print(np.__version__)
1.11.0

How to create a vector

One way to create a vector is to construct it from a given Python list. For example we use the list [1, 2, 3, 4] for creating a vector using the array function from numpy:

a = np.array([1, 2, 3, 4])
print(a)
[1 2 3 4]

This looks like a list without commata.

To see what common type the elements have:

print(a.dtype)
int64

So this is a vector with 64 bit integers.

But... in most / all cases I need a vector with floating point values !!

Either start with a list of floating point values:

a = np.array([1.0, 2.0, 3.0, 4.0])
print(a)
print(a.dtype)
[ 1.  2.  3.  4.]
float64

Or instruct array as follows:

a = np.array([1, 2, 3, 4], dtype=float)
print(a)
print(a.dtype)
[ 1.  2.  3.  4.]
float64

Often you need an equispaced vector, so to create a vector of 10 elements starting at 1.0, ending with 3.0 the linspace function will help us:

x = np.linspace(1.0, 3.0, 10)
print(x)
[ 1.          1.22222222  1.44444444  1.66666667  1.88888889  2.11111111
  2.33333333  2.55555556  2.77777778  3.        ]
print(x.dtype)
float64

If you have the starting value, the end value and the distance between two values of the vector arange does the job:

x = np.arange(2, 4, .5)  # upper limit is exclusive !!!
print(x)
[ 2.   2.5  3.   3.5]

ufuncs

A so called ufunc (this is numpy slang for "universal function") works on a vector in an element-by-element fashion. That is it creates a new vector from a given one by transforming every element of the input vector.

This is shorter code than using lists, and much faster for large vectors:

print(x)
y = np.sin(x)
print(y)
[ 2.   2.5  3.   3.5]
[ 0.90929743  0.59847214  0.14112001 -0.35078323]

Using the already introduced math module does not work here:

import math
math.sin(y)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-37fd51d5d58d> in <module>()
      1 import math
----> 2 math.sin(y)

TypeError: only length-1 arrays can be converted to Python scalars

Other ufunc functions: np.exp etc, actually everything you find in math module.

Vector by vector operations

Operations as * involving two vectors are element wise too:

print(x * x)
[  4.     6.25   9.    12.25]
print(x / x)
[ 1.  1.  1.  1.]
print(x ** 3)
[  8.     15.625  27.     42.875]

Broadcasted operations

So called broadcasted operations: One arg is a traditional float, the other one is a numpy array. Operations again are elementwise:

print(3 * x)
[  6.    7.5   9.   10.5]
x -= 0.1
print(x)
[ 1.9  2.4  2.9  3.4]

Aggregation functions

Aggregation functions map a vector to a single number:

print(x)
print(np.min(x), np.max(x))
[ 1.9  2.4  2.9  3.4]
1.9 3.4
print(np.mean(x), np.std(x), np.std(x, ddof=1))
2.65 0.559016994375 0.645497224368
np.sum(x), np.prod(x)
(10.6, 44.96159999999999)

Vector length:

np.linalg.norm(x)
5.4166410255803363

dot product, other names "scalar" or "inner" product:

np.dot(x, y)
2.3805833060471979

Accessing vector elements

Before we start, we print x again, so it is easier to understand the output of the following examples:

print(x)
[ 1.9  2.4  2.9  3.4]

Slicing works as usual:

y = x[1: -1]
print(type(y))
print(y)
<class 'numpy.ndarray'>
[ 2.4  2.9]

Select values at given positions:

print(x[[0, 2, 3]])
[ 1.9  2.9  3.4]

Accessing elements based on logical conditions

Select values with a certain property, eg values which are larger than 2.0:

print(x[x > 2.0])  # read this as "x where x is greater than 2.0"
[ 2.4  2.9  3.4]

How does this work ?

First observation: we can index a numpy array with another numpy array of same size holding boolean values

idx = np.array([True, False, True, False])
print(x[idx])
[ 1.9  2.9]

Second observation: x > 2.0 delivers such an boolean array:

print(x > 2.0)
[False  True  True  True]

This syntax can be used on the left side of = for assigning values:

print(x)
x[x > 2.0] = -1
print(x)
[ 1.9  2.4  2.9  3.4]
[ 1.9 -1.  -1.  -1. ]

Exercises:

  • repeat the examples

What about matrices ?

We handle some matrix stuff at the end of the script....

Introduction to plotting with matplotlib

# IGNORE THE LINE BELOW, IT IS ONLY FOR CREATING THIS SCRIPT!
%matplotlib inline
x = np.linspace(0.0, 2 * np.pi, 20)
y = np.sin(x)
import pylab
pylab.plot(x, y)
pylab.show()

Blue is the standard color. To change this use:

pylab.plot(x, y, "green")
pylab.show()

To plot dots, eg red dots:

pylab.plot(x, y, "r.")
pylab.show()

Overlay multiple plots

pylab.plot(x, y)
pylab.plot(x, y, "ro")   # "o" means: big dots
pylab.plot(x, np.cos(x), "green")
pylab.show()

Adding grids

pylab.plot(x, y)
pylab.grid(True)
pylab.show()

Adding labels to the plot

pylab.plot(x, y, label="wave")
pylab.plot(x, np.cos(x), label="shifted wave")
pylab.grid(True)
pylab.xlabel("time")
pylab.ylabel("amplitude")
pylab.title("sine wave over time")
pylab.legend()  # activates the legent in the upper corner
pylab.show()

Arranging multiple plots

# 1 row, 2 columns ,first plot:
pylab.subplot(1, 2, 1) 
pylab.plot(x, y, label="sin")
pylab.legend()
# 1 row, 2 columns, second plot:
pylab.subplot(1, 2, 2)
pylab.plot(x, np.cos(x), "green")
pylab.grid(True)
pylab.show()

How to store a plot to the file system

pylab.plot(x, y)
pylab.savefig("sin.png")
!ls -l sin.png
-rw-r--r--  1 uweschmitt  staff  9905 27 Jun 20:23 sin.png

Some exampels for other types of plots

N = 50
x = np.random.rand(N)     # N randoms with uniform distribution in range 0 .. 1
y = np.random.rand(N)     # dito

radiuses = 15 * np.random.rand(N)  # N random uniform distributed radiuses in range 0 .. 15
areas = np.pi * radiuses ** 2      # you see numpy here ?
colors = np.random.rand(N)         # and N random colors

pylab.scatter(x, y, s=areas, c=colors, alpha=0.5)
pylab.show()
N = 2500
values_1 = 0.5 * np.random.randn(N)  # 500 normal distributes
values_2 = 0.2 + np.tan(0.2 * np.random.randn(N)) # some strange distribution

# alpha is transparency, "stipfilled" ommits the lines about the single rectangles:
pylab.hist(values_1, bins=50, color="green", alpha=0.3, histtype="stepfilled")
pylab.hist(values_2, bins=50, color="red", alpha=0.6)
pylab.show()

Thats not all folks, many more examples at:

http://matplotlib.org/gallery.html

Beyond matplotlib, there are some new libraries, see:

http://ggplot.yhathq.com/

Excercises:

  • repeat examples

Introduction to scipy

scipy provides stable and high quality algorithms for lots of math related tasks.

For example

  • special functions
  • interpolation 1d, 2d, splines, ...
  • optimization, eg curve fitting
  • statistics
  • ...

overview: http://docs.scipy.org/doc/scipy/reference/tutorial/index.html

Numerical integration

scipy.integrate.quad takes the function to integrate and lower and upper integration limits:

import scipy.integrate
area, error_estimate = scipy.integrate.quad(np.sin, 0.0,  np.pi)
print(area)
print(error_estimate)
2.0
2.220446049250313e-14

An example with a more complicated function:

def strange(x):
    return np.sin(np.cos(x) + 1.0)

print(scipy.integrate.quad(strange, 0.0, 1.0))
(0.9542771279840992, 1.0594604393611143e-14)

Curve fitting

We demonstrate how to fit $$f(x) = a \sin(x) + b \cos(c + x)$$ to given data:

def function(x, a, b, c):
    """x is a vector holding the x values for evaluation
    a, b and c are parameters of the function
    """
    return a * np.sin(2 * x) + b * np.cos (c + x)

Generate test data:

a = 1.0
b = 2.0
c = 0.1
x = np.linspace(0.0, 6.0, 40)

# artificial y(x) for given parameters:
y = function(x, a, b, c)

# we add normal distributed "noise":
y_measured = y + 0.5 * np.random.randn(len(x))
pylab.plot(x, y)
pylab.plot(x, y_measured, "o")
pylab.show()

Now we want to calculate estimates for a, b and c from x and y_measured:

import scipy.optimize
estimated, __ = scipy.optimize.curve_fit(function, x, y_measured)
print(a, b, c)
print(estimated)
1.0 2.0 0.1
[ 1.09731765  2.04016658  0.03714652]
a_est, b_est, c_est = estimated
y_estimated = function(x, a_est, b_est, c_est)
pylab.plot(x, y, label="y")
pylab.plot(x, y_estimated, label="y_estimated")
pylab.plot(x, y_measured, "o")
pylab.legend()
pylab.show()

Exercise

Matrices and linear algebra with numpy

To create a matrix you can start from a nested list, similar to the way we created vectors in the beginning of the script:

m = np.array([[1, 2, 3], [2, 3, 4]], dtype=float)
print(m)
[[ 1.  2.  3.]
 [ 2.  3.  4.]]

m has two rows and three columns:

print(m.shape)
(2, 3)

For special matrices we have functions eye, ones and zeros:

print(np.eye(3))    # identity matrix
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
print(np.ones((3, 2), dtype=float))
[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]
print(np.zeros((2, 3), dtype=float))
[[ 0.  0.  0.]
 [ 0.  0.  0.]]

ufuncs work again:

print(np.sin(m))
[[ 0.84147098  0.90929743  0.14112001]
 [ 0.90929743  0.14112001 -0.7568025 ]]

+ and * too:

print(m + 1)
[[ 2.  3.  4.]
 [ 3.  4.  5.]]
print(2 * m)
[[ 2.  4.  6.]
 [ 4.  6.  8.]]
print(m + m)
[[ 2.  4.  6.]
 [ 4.  6.  8.]]

* is element-by-element-, not matrix-matrix-multiplication:

print(m * m)
[[  1.   4.   9.]
 [  4.   9.  16.]]

.T transposes:

print(m.T)
[[ 1.  2.]
 [ 2.  3.]
 [ 3.  4.]]

The dimentions of m and m.T fit for matrix-matrix multplication. As said * does not work, we have to use dot function:

print(np.dot(m, m.T))
[[ 14.  20.]
 [ 20.  29.]]

Python 3.5 introduced the new operator @ solely for the purpose of simplifying matrix multiplication in numpy code:

print(m @ m.T)
[[ 14.  20.]
 [ 20.  29.]]

This notation is much more readoble for longer expressions and very similar to the mathematical notation:

print(m @ m.T @ m @ m.T)
[[  596.   860.]
 [  860.  1241.]]
np.dot(m, np.dot(m.T, np.dot(m, m.T)))
array([[  596.,   860.],
       [  860.,  1241.]])

Solving linear equations

To solve an equation system

$ x + 2 y + 3 z = 14 $

$ x + y = 3$

$ y + 2 z = 8 $

we start with the coefficient matrix

A = np.array( [[1, 2, 3], [1, 1, 0], [0, 1, 2]], dtype=float)
print(A)
[[ 1.  2.  3.]
 [ 1.  1.  0.]
 [ 0.  1.  2.]]

The right hand side is

y = np.array([14, 3, 8], dtype=float)

And the solution is computed using linalg.solve from numpy:

x = np.linalg.solve(A, y)
print(x)
[ 1.  2.  3.]

To check our result, we "insert x into the equation" and compare it toy`:

print(np.dot(A, x) - y)
[ 0.  0.  0.]
print(A @ x - y)
[ 0.  0.  0.]
# THE LINES BELOW ARE JUST FOR FORMATTING THE INSTRUCTIONS ABOVE !
from IPython import utils
from IPython.core.display import HTML
import os
def css_styling():
    """Load default custom.css file from ipython profile"""
    base = utils.path.get_ipython_dir()
    styles = """<style>
    
    @import url('http://fonts.googleapis.com/css?family=Source+Code+Pro');
    
    @import url('http://fonts.googleapis.com/css?family=Kameron');
    @import url('http://fonts.googleapis.com/css?family=Crimson+Text');
    
    @import url('http://fonts.googleapis.com/css?family=Lato');
    @import url('http://fonts.googleapis.com/css?family=Source+Sans+Pro');
    
    @import url('http://fonts.googleapis.com/css?family=Lora'); 

    
    body {
        font-family: 'Lora', Consolas, sans-serif;
      
    }
    .rendered_html code
    {
        color: black;
        background: #eaf0ff;
        padding: 1pt;
        font-family:  'Source Code Pro', Consolas, monocco, monospace;
    }
    
    .CodeMirror pre {
    font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
    }
    
    .cm-s-ipython span.cm-keyword {
        font-weight: normal;
     }
     
     strong {
         background: #ffe7e7;
         padding: 1pt;
     }
     
    
    div #notebook {
        # font-size: 10pt; 
        line-height: 145%;
        }
        
    li {
        line-heigt: 145%;
    }

    div.output_area pre {
        background: #fffdf0;
        padding: 3pt;
    }
    
    h1, h2, h3, h4 {
        font-family: Kameron, arial;
    }
    
    div#maintoolbar {display: none !important;}
    </style>"""
    return HTML(styles)
css_styling()
/Users/uweschmitt/Projects/python3-course/lib/python3.5/site-packages/IPython/utils/path.py:258: UserWarning: get_ipython_dir has moved to the IPython.paths module
  warn("get_ipython_dir has moved to the IPython.paths module")