Python eco system for data analysis¶

numpy: provides data containers for linear algebra, eg vectors and matrices + basic linear algebra algorithms
matplotlib: provides plotting routines
scipy: common algorithms on vectors and matrices, eg statistics and routines from numerical analysis
so called scikits for special topics, eg scikit.learn for machine learning, others for time series analysis, image analysis, statistics, etc
pandas tries to mimic the R data.frame type. In contrast to numpy containers, every column has its own type. http://pandas.pydata.org/pandas-docs/stable/10min.html#min

Before we start:

check if import numpy, import scipy and import matplotlib work on your machine.
if this works: fine, skip the next steps !
Else read http://www.scipy.org/install.html

Introduction to numpy¶

numpy: Python module with efficient data structures for vectors and matrices.

If you are not familar with vectors: like a list of floating point or integer numbers, all having the same type.

Before we start we import numpy and give it a shorter name:

import numpy as np
print(np.__version__)

1.11.0

How to create a vector¶

One way to create a vector is to construct it from a given Python list. For example we use the list [1, 2, 3, 4] for creating a vector using the array function from numpy:

a = np.array([1, 2, 3, 4])
print(a)

[1 2 3 4]

This looks like a list without commata.

To see what common type the elements have:

print(a.dtype)

int64

So this is a vector with 64 bit integers.

But... in most / all cases I need a vector with floating point values !!

Either start with a list of floating point values:

a = np.array([1.0, 2.0, 3.0, 4.0])
print(a)
print(a.dtype)

[ 1.  2.  3.  4.]
float64

Or instruct array as follows:

a = np.array([1, 2, 3, 4], dtype=float)
print(a)
print(a.dtype)

[ 1.  2.  3.  4.]
float64

Often you need an equispaced vector, so to create a vector of 10 elements starting at 1.0, ending with 3.0 the linspace function will help us:

x = np.linspace(1.0, 3.0, 10)
print(x)

[ 1.          1.22222222  1.44444444  1.66666667  1.88888889  2.11111111
  2.33333333  2.55555556  2.77777778  3.        ]

print(x.dtype)

float64

If you have the starting value, the end value and the distance between two values of the vector arange does the job:

x = np.arange(2, 4, .5)  # upper limit is exclusive !!!
print(x)

[ 2.   2.5  3.   3.5]

ufuncs¶

A so called ufunc (this is numpy slang for "universal function") works on a vector in an element-by-element fashion. That is it creates a new vector from a given one by transforming every element of the input vector.

This is shorter code than using lists, and much faster for large vectors:

print(x)
y = np.sin(x)
print(y)

[ 2.   2.5  3.   3.5]
[ 0.90929743  0.59847214  0.14112001 -0.35078323]

Using the already introduced math module does not work here:

import math
math.sin(y)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-37fd51d5d58d> in <module>()
      1 import math
----> 2 math.sin(y)

TypeError: only length-1 arrays can be converted to Python scalars

Other ufunc functions: np.exp etc, actually everything you find in math module.

Vector by vector operations¶

Operations as * involving two vectors are element wise too:

print(x * x)

[  4.     6.25   9.    12.25]

print(x / x)

[ 1.  1.  1.  1.]

print(x ** 3)

[  8.     15.625  27.     42.875]

Broadcasted operations¶

So called broadcasted operations: One arg is a traditional float, the other one is a numpy array. Operations again are elementwise:

print(3 * x)

[  6.    7.5   9.   10.5]

x -= 0.1
print(x)

[ 1.9  2.4  2.9  3.4]

Aggregation functions¶

Aggregation functions map a vector to a single number:

print(x)
print(np.min(x), np.max(x))

[ 1.9  2.4  2.9  3.4]
1.9 3.4

print(np.mean(x), np.std(x), np.std(x, ddof=1))

2.65 0.559016994375 0.645497224368

What is this ddof parameter ? See http://stackoverflow.com/questions/7482205/precision-why-do-matlab-and-python-numpy-give-so-different-outputs/7482413#7482413

np.sum(x), np.prod(x)

(10.6, 44.96159999999999)

Vector length:

np.linalg.norm(x)

5.4166410255803363

dot product, other names "scalar" or "inner" product:

np.dot(x, y)

2.3805833060471979

Accessing vector elements¶

Before we start, we print x again, so it is easier to understand the output of the following examples:

print(x)

[ 1.9  2.4  2.9  3.4]

Slicing works as usual:

y = x[1: -1]
print(type(y))
print(y)

<class 'numpy.ndarray'>
[ 2.4  2.9]

Select values at given positions:

print(x[[0, 2, 3]])

[ 1.9  2.9  3.4]

Accessing elements based on logical conditions¶

Select values with a certain property, eg values which are larger than 2.0:

print(x[x > 2.0])  # read this as "x where x is greater than 2.0"

[ 2.4  2.9  3.4]

How does this work ?

First observation: we can index a numpy array with another numpy array of same size holding boolean values

idx = np.array([True, False, True, False])
print(x[idx])

[ 1.9  2.9]

Second observation: x > 2.0 delivers such an boolean array:

print(x > 2.0)

[False  True  True  True]

This syntax can be used on the left side of = for assigning values:

print(x)
x[x > 2.0] = -1
print(x)

[ 1.9  2.4  2.9  3.4]
[ 1.9 -1.  -1.  -1. ]

Exercises:¶

repeat the examples

What about matrices ?¶

We handle some matrix stuff at the end of the script....

Introduction to plotting with matplotlib¶

# IGNORE THE LINE BELOW, IT IS ONLY FOR CREATING THIS SCRIPT!
%matplotlib inline

x = np.linspace(0.0, 2 * np.pi, 20)
y = np.sin(x)

import pylab
pylab.plot(x, y)
pylab.show()

Blue is the standard color. To change this use:

pylab.plot(x, y, "green")
pylab.show()

To plot dots, eg red dots:

pylab.plot(x, y, "r.")
pylab.show()

Overlay multiple plots¶

pylab.plot(x, y)
pylab.plot(x, y, "ro")   # "o" means: big dots
pylab.plot(x, np.cos(x), "green")
pylab.show()

Adding grids¶

pylab.plot(x, y)
pylab.grid(True)
pylab.show()

Adding labels to the plot¶

pylab.plot(x, y, label="wave")
pylab.plot(x, np.cos(x), label="shifted wave")
pylab.grid(True)
pylab.xlabel("time")
pylab.ylabel("amplitude")
pylab.title("sine wave over time")
pylab.legend()  # activates the legent in the upper corner
pylab.show()

Arranging multiple plots¶

# 1 row, 2 columns ,first plot:
pylab.subplot(1, 2, 1) 
pylab.plot(x, y, label="sin")
pylab.legend()
# 1 row, 2 columns, second plot:
pylab.subplot(1, 2, 2)
pylab.plot(x, np.cos(x), "green")
pylab.grid(True)
pylab.show()

How to store a plot to the file system¶

pylab.plot(x, y)
pylab.savefig("sin.png")
!ls -l sin.png

-rw-r--r--  1 uweschmitt  staff  9905 27 Jun 20:23 sin.png

Some exampels for other types of plots¶

N = 50
x = np.random.rand(N)     # N randoms with uniform distribution in range 0 .. 1
y = np.random.rand(N)     # dito

radiuses = 15 * np.random.rand(N)  # N random uniform distributed radiuses in range 0 .. 15
areas = np.pi * radiuses ** 2      # you see numpy here ?
colors = np.random.rand(N)         # and N random colors

pylab.scatter(x, y, s=areas, c=colors, alpha=0.5)
pylab.show()

N = 2500
values_1 = 0.5 * np.random.randn(N)  # 500 normal distributes
values_2 = 0.2 + np.tan(0.2 * np.random.randn(N)) # some strange distribution

# alpha is transparency, "stipfilled" ommits the lines about the single rectangles:
pylab.hist(values_1, bins=50, color="green", alpha=0.3, histtype="stepfilled")
pylab.hist(values_2, bins=50, color="red", alpha=0.6)
pylab.show()

Thats not all folks, many more examples at:

http://matplotlib.org/gallery.html ¶

Beyond matplotlib, there are some new libraries, see:

http://ggplot.yhathq.com/

Excercises:¶

repeat examples

Introduction to scipy¶

scipy provides stable and high quality algorithms for lots of math related tasks.

For example

special functions
interpolation 1d, 2d, splines, ...
optimization, eg curve fitting
statistics
...

overview: http://docs.scipy.org/doc/scipy/reference/tutorial/index.html

Numerical integration¶

scipy.integrate.quad takes the function to integrate and lower and upper integration limits:

import scipy.integrate
area, error_estimate = scipy.integrate.quad(np.sin, 0.0,  np.pi)
print(area)
print(error_estimate)

2.0
2.220446049250313e-14

An example with a more complicated function:

def strange(x):
    return np.sin(np.cos(x) + 1.0)

print(scipy.integrate.quad(strange, 0.0, 1.0))

(0.9542771279840992, 1.0594604393611143e-14)

Curve fitting¶

We demonstrate how to fit $$f(x) = a \sin(x) + b \cos(c + x)$$ to given data:

def function(x, a, b, c):
    """x is a vector holding the x values for evaluation
    a, b and c are parameters of the function
    """
    return a * np.sin(2 * x) + b * np.cos (c + x)

Generate test data:

a = 1.0
b = 2.0
c = 0.1
x = np.linspace(0.0, 6.0, 40)

# artificial y(x) for given parameters:
y = function(x, a, b, c)

# we add normal distributed "noise":
y_measured = y + 0.5 * np.random.randn(len(x))

pylab.plot(x, y)
pylab.plot(x, y_measured, "o")
pylab.show()

Now we want to calculate estimates for a, b and c from x and y_measured:

import scipy.optimize
estimated, __ = scipy.optimize.curve_fit(function, x, y_measured)

print(a, b, c)
print(estimated)

1.0 2.0 0.1
[ 1.09731765  2.04016658  0.03714652]

a_est, b_est, c_est = estimated
y_estimated = function(x, a_est, b_est, c_est)
pylab.plot(x, y, label="y")
pylab.plot(x, y_estimated, label="y_estimated")
pylab.plot(x, y_measured, "o")
pylab.legend()
pylab.show()

Exercise¶

Repeat examples, play with values for a, b, c.
lookup https://en.wikipedia.org/wiki/Gompertz_function#Growth_of_tumors
play with this function: plot some curves for some parameters
create an artifical "noisy" data set and fit the parameters

Matrices and linear algebra with `numpy`¶

To create a matrix you can start from a nested list, similar to the way we created vectors in the beginning of the script:

m = np.array([[1, 2, 3], [2, 3, 4]], dtype=float)
print(m)

[[ 1.  2.  3.]
 [ 2.  3.  4.]]

m has two rows and three columns:

print(m.shape)

(2, 3)

For special matrices we have functions eye, ones and zeros:

print(np.eye(3))    # identity matrix

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

print(np.ones((3, 2), dtype=float))

[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]

print(np.zeros((2, 3), dtype=float))

[[ 0.  0.  0.]
 [ 0.  0.  0.]]

ufuncs work again:

print(np.sin(m))

[[ 0.84147098  0.90929743  0.14112001]
 [ 0.90929743  0.14112001 -0.7568025 ]]

+ and * too:

print(m + 1)

[[ 2.  3.  4.]
 [ 3.  4.  5.]]

print(2 * m)

[[ 2.  4.  6.]
 [ 4.  6.  8.]]

print(m + m)

[[ 2.  4.  6.]
 [ 4.  6.  8.]]

* is element-by-element-, not matrix-matrix-multiplication:

print(m * m)

[[  1.   4.   9.]
 [  4.   9.  16.]]

.T transposes:

print(m.T)

[[ 1.  2.]
 [ 2.  3.]
 [ 3.  4.]]

The dimentions of m and m.T fit for matrix-matrix multplication. As said * does not work, we have to use dot function:

print(np.dot(m, m.T))

[[ 14.  20.]
 [ 20.  29.]]

Python 3.5 introduced the new operator @ solely for the purpose of simplifying matrix multiplication in numpy code:

print(m @ m.T)

[[ 14.  20.]
 [ 20.  29.]]

This notation is much more readoble for longer expressions and very similar to the mathematical notation:

print(m @ m.T @ m @ m.T)

[[  596.   860.]
 [  860.  1241.]]

np.dot(m, np.dot(m.T, np.dot(m, m.T)))

array([[  596.,   860.],
       [  860.,  1241.]])

Solving linear equations¶

To solve an equation system

$ x + 2 y + 3 z = 14 $

$ x + y = 3$

$ y + 2 z = 8 $

we start with the coefficient matrix

A = np.array( [[1, 2, 3], [1, 1, 0], [0, 1, 2]], dtype=float)
print(A)

[[ 1.  2.  3.]
 [ 1.  1.  0.]
 [ 0.  1.  2.]]

The right hand side is

y = np.array([14, 3, 8], dtype=float)

And the solution is computed using linalg.solve from numpy:

x = np.linalg.solve(A, y)
print(x)

[ 1.  2.  3.]

To check our result, we "insert x into the equation" and compare it toy`:

print(np.dot(A, x) - y)

[ 0.  0.  0.]

print(A @ x - y)

[ 0.  0.  0.]

# THE LINES BELOW ARE JUST FOR FORMATTING THE INSTRUCTIONS ABOVE !
from IPython import utils
from IPython.core.display import HTML
import os
def css_styling():
    """Load default custom.css file from ipython profile"""
    base = utils.path.get_ipython_dir()
    styles = """<style>
    
    @import url('http://fonts.googleapis.com/css?family=Source+Code+Pro');
    
    @import url('http://fonts.googleapis.com/css?family=Kameron');
    @import url('http://fonts.googleapis.com/css?family=Crimson+Text');
    
    @import url('http://fonts.googleapis.com/css?family=Lato');
    @import url('http://fonts.googleapis.com/css?family=Source+Sans+Pro');
    
    @import url('http://fonts.googleapis.com/css?family=Lora'); 

    
    body {
        font-family: 'Lora', Consolas, sans-serif;
      
    }
    .rendered_html code
    {
        color: black;
        background: #eaf0ff;
        padding: 1pt;
        font-family:  'Source Code Pro', Consolas, monocco, monospace;
    }
    
    .CodeMirror pre {
    font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
    }
    
    .cm-s-ipython span.cm-keyword {
        font-weight: normal;
     }
     
     strong {
         background: #ffe7e7;
         padding: 1pt;
     }
     
    
    div #notebook {
        # font-size: 10pt; 
        line-height: 145%;
        }
        
    li {
        line-heigt: 145%;
    }

    div.output_area pre {
        background: #fffdf0;
        padding: 3pt;
    }
    
    h1, h2, h3, h4 {
        font-family: Kameron, arial;
    }
    
    div#maintoolbar {display: none !important;}
    </style>"""
    return HTML(styles)
css_styling()

/Users/uweschmitt/Projects/python3-course/lib/python3.5/site-packages/IPython/utils/path.py:258: UserWarning: get_ipython_dir has moved to the IPython.paths module
  warn("get_ipython_dir has moved to the IPython.paths module")