# Python eco system for data analysis¶

• numpy: provides data containers for linear algebra, eg vectors and matrices + basic linear algebra algorithms
• matplotlib: provides plotting routines
• scipy: common algorithms on vectors and matrices, eg statistics and routines from numerical analysis
• so called scikits for special topics, eg scikit.learn for machine learning, others for time series analysis, image analysis, statistics, etc
• pandas tries to mimic the R data.frame type. In contrast to numpy containers, every column has its own type. http://pandas.pydata.org/pandas-docs/stable/10min.html#min

Before we start:

1. check if import numpy, import scipy and import matplotlib work on your machine.
2. if this works: fine, skip the next steps !
3. Else read http://www.scipy.org/install.html

# Introduction to numpy¶

numpy: Python module with efficient data structures for vectors and matrices.

If you are not familar with vectors: like a list of floating point or integer numbers, all having the same type.

Before we start we import numpy and give it a shorter name:

import numpy as np
print(np.__version__)

1.11.0


### How to create a vector¶

One way to create a vector is to construct it from a given Python list. For example we use the list [1, 2, 3, 4] for creating a vector using the array function from numpy:

a = np.array([1, 2, 3, 4])
print(a)

[1 2 3 4]


This looks like a list without commata.

To see what common type the elements have:

print(a.dtype)

int64


So this is a vector with 64 bit integers.

But... in most / all cases I need a vector with floating point values !!

Either start with a list of floating point values:

a = np.array([1.0, 2.0, 3.0, 4.0])
print(a)
print(a.dtype)

[ 1.  2.  3.  4.]
float64


Or instruct array as follows:

a = np.array([1, 2, 3, 4], dtype=float)
print(a)
print(a.dtype)

[ 1.  2.  3.  4.]
float64


Often you need an equispaced vector, so to create a vector of 10 elements starting at 1.0, ending with 3.0 the linspace function will help us:

x = np.linspace(1.0, 3.0, 10)
print(x)

[ 1.          1.22222222  1.44444444  1.66666667  1.88888889  2.11111111
2.33333333  2.55555556  2.77777778  3.        ]

print(x.dtype)

float64


If you have the starting value, the end value and the distance between two values of the vector arange does the job:

x = np.arange(2, 4, .5)  # upper limit is exclusive !!!
print(x)

[ 2.   2.5  3.   3.5]


### ufuncs¶

A so called ufunc (this is numpy slang for "universal function") works on a vector in an element-by-element fashion. That is it creates a new vector from a given one by transforming every element of the input vector.

This is shorter code than using lists, and much faster for large vectors:

print(x)
y = np.sin(x)
print(y)

[ 2.   2.5  3.   3.5]
[ 0.90929743  0.59847214  0.14112001 -0.35078323]


Using the already introduced math module does not work here:

import math
math.sin(y)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-37fd51d5d58d> in <module>()
1 import math
----> 2 math.sin(y)

TypeError: only length-1 arrays can be converted to Python scalars

Other ufunc functions: np.exp etc, actually everything you find in math module.

### Vector by vector operations¶

Operations as * involving two vectors are element wise too:

print(x * x)

[  4.     6.25   9.    12.25]

print(x / x)

[ 1.  1.  1.  1.]

print(x ** 3)

[  8.     15.625  27.     42.875]


So called broadcasted operations: One arg is a traditional float, the other one is a numpy array. Operations again are elementwise:

print(3 * x)

[  6.    7.5   9.   10.5]

x -= 0.1
print(x)

[ 1.9  2.4  2.9  3.4]


### Aggregation functions¶

Aggregation functions map a vector to a single number:

print(x)
print(np.min(x), np.max(x))

[ 1.9  2.4  2.9  3.4]
1.9 3.4

print(np.mean(x), np.std(x), np.std(x, ddof=1))

2.65 0.559016994375 0.645497224368

np.sum(x), np.prod(x)

(10.6, 44.96159999999999)

Vector length:

np.linalg.norm(x)

5.4166410255803363

dot product, other names "scalar" or "inner" product:

np.dot(x, y)

2.3805833060471979

### Accessing vector elements¶

Before we start, we print x again, so it is easier to understand the output of the following examples:

print(x)

[ 1.9  2.4  2.9  3.4]


Slicing works as usual:

y = x[1: -1]
print(type(y))
print(y)

<class 'numpy.ndarray'>
[ 2.4  2.9]


Select values at given positions:

print(x[[0, 2, 3]])

[ 1.9  2.9  3.4]


### Accessing elements based on logical conditions¶

Select values with a certain property, eg values which are larger than 2.0:

print(x[x > 2.0])  # read this as "x where x is greater than 2.0"

[ 2.4  2.9  3.4]


How does this work ?

First observation: we can index a numpy array with another numpy array of same size holding boolean values

idx = np.array([True, False, True, False])
print(x[idx])

[ 1.9  2.9]


Second observation: x > 2.0 delivers such an boolean array:

print(x > 2.0)

[False  True  True  True]


This syntax can be used on the left side of = for assigning values:

print(x)
x[x > 2.0] = -1
print(x)

[ 1.9  2.4  2.9  3.4]
[ 1.9 -1.  -1.  -1. ]


## Exercises:¶

• repeat the examples

## What about matrices ?¶

We handle some matrix stuff at the end of the script....

# Introduction to plotting with matplotlib¶

# IGNORE THE LINE BELOW, IT IS ONLY FOR CREATING THIS SCRIPT!
%matplotlib inline

x = np.linspace(0.0, 2 * np.pi, 20)
y = np.sin(x)

import pylab
pylab.plot(x, y)
pylab.show()


Blue is the standard color. To change this use:

pylab.plot(x, y, "green")
pylab.show()


To plot dots, eg red dots:

pylab.plot(x, y, "r.")
pylab.show()


### Overlay multiple plots¶

pylab.plot(x, y)
pylab.plot(x, y, "ro")   # "o" means: big dots
pylab.plot(x, np.cos(x), "green")
pylab.show()


pylab.plot(x, y)
pylab.grid(True)
pylab.show()


### Adding labels to the plot¶

pylab.plot(x, y, label="wave")
pylab.plot(x, np.cos(x), label="shifted wave")
pylab.grid(True)
pylab.xlabel("time")
pylab.ylabel("amplitude")
pylab.title("sine wave over time")
pylab.legend()  # activates the legent in the upper corner
pylab.show()


### Arranging multiple plots¶

# 1 row, 2 columns ,first plot:
pylab.subplot(1, 2, 1)
pylab.plot(x, y, label="sin")
pylab.legend()
# 1 row, 2 columns, second plot:
pylab.subplot(1, 2, 2)
pylab.plot(x, np.cos(x), "green")
pylab.grid(True)
pylab.show()


### How to store a plot to the file system¶

pylab.plot(x, y)
pylab.savefig("sin.png")
!ls -l sin.png

-rw-r--r--  1 uweschmitt  staff  9905 27 Jun 20:23 sin.png


### Some exampels for other types of plots¶

N = 50
x = np.random.rand(N)     # N randoms with uniform distribution in range 0 .. 1
y = np.random.rand(N)     # dito

radiuses = 15 * np.random.rand(N)  # N random uniform distributed radiuses in range 0 .. 15
areas = np.pi * radiuses ** 2      # you see numpy here ?
colors = np.random.rand(N)         # and N random colors

pylab.scatter(x, y, s=areas, c=colors, alpha=0.5)
pylab.show()

N = 2500
values_1 = 0.5 * np.random.randn(N)  # 500 normal distributes
values_2 = 0.2 + np.tan(0.2 * np.random.randn(N)) # some strange distribution

# alpha is transparency, "stipfilled" ommits the lines about the single rectangles:
pylab.hist(values_1, bins=50, color="green", alpha=0.3, histtype="stepfilled")
pylab.hist(values_2, bins=50, color="red", alpha=0.6)
pylab.show()


Thats not all folks, many more examples at:

### http://matplotlib.org/gallery.html¶

Beyond matplotlib, there are some new libraries, see:

http://ggplot.yhathq.com/

## Excercises:¶

• repeat examples

# Introduction to scipy¶

scipy provides stable and high quality algorithms for lots of math related tasks.

For example

• special functions
• interpolation 1d, 2d, splines, ...
• optimization, eg curve fitting
• statistics
• ...

### Numerical integration¶

scipy.integrate.quad takes the function to integrate and lower and upper integration limits:

import scipy.integrate
area, error_estimate = scipy.integrate.quad(np.sin, 0.0,  np.pi)
print(area)
print(error_estimate)

2.0
2.220446049250313e-14


An example with a more complicated function:

def strange(x):
return np.sin(np.cos(x) + 1.0)


(0.9542771279840992, 1.0594604393611143e-14)


### Curve fitting¶

We demonstrate how to fit $$f(x) = a \sin(x) + b \cos(c + x)$$ to given data:

def function(x, a, b, c):
"""x is a vector holding the x values for evaluation
a, b and c are parameters of the function
"""
return a * np.sin(2 * x) + b * np.cos (c + x)


Generate test data:

a = 1.0
b = 2.0
c = 0.1
x = np.linspace(0.0, 6.0, 40)

# artificial y(x) for given parameters:
y = function(x, a, b, c)

# we add normal distributed "noise":
y_measured = y + 0.5 * np.random.randn(len(x))

pylab.plot(x, y)
pylab.plot(x, y_measured, "o")
pylab.show()


Now we want to calculate estimates for a, b and c from x and y_measured:

import scipy.optimize
estimated, __ = scipy.optimize.curve_fit(function, x, y_measured)

print(a, b, c)
print(estimated)

1.0 2.0 0.1
[ 1.09731765  2.04016658  0.03714652]

a_est, b_est, c_est = estimated
y_estimated = function(x, a_est, b_est, c_est)
pylab.plot(x, y, label="y")
pylab.plot(x, y_estimated, label="y_estimated")
pylab.plot(x, y_measured, "o")
pylab.legend()
pylab.show()


# Matrices and linear algebra with numpy¶

To create a matrix you can start from a nested list, similar to the way we created vectors in the beginning of the script:

m = np.array([[1, 2, 3], [2, 3, 4]], dtype=float)
print(m)

[[ 1.  2.  3.]
[ 2.  3.  4.]]


m has two rows and three columns:

print(m.shape)

(2, 3)


For special matrices we have functions eye, ones and zeros:

print(np.eye(3))    # identity matrix

[[ 1.  0.  0.]
[ 0.  1.  0.]
[ 0.  0.  1.]]

print(np.ones((3, 2), dtype=float))

[[ 1.  1.]
[ 1.  1.]
[ 1.  1.]]

print(np.zeros((2, 3), dtype=float))

[[ 0.  0.  0.]
[ 0.  0.  0.]]


ufuncs work again:

print(np.sin(m))

[[ 0.84147098  0.90929743  0.14112001]
[ 0.90929743  0.14112001 -0.7568025 ]]


+ and * too:

print(m + 1)

[[ 2.  3.  4.]
[ 3.  4.  5.]]

print(2 * m)

[[ 2.  4.  6.]
[ 4.  6.  8.]]

print(m + m)

[[ 2.  4.  6.]
[ 4.  6.  8.]]


* is element-by-element-, not matrix-matrix-multiplication:

print(m * m)

[[  1.   4.   9.]
[  4.   9.  16.]]


.T transposes:

print(m.T)

[[ 1.  2.]
[ 2.  3.]
[ 3.  4.]]


The dimentions of m and m.T fit for matrix-matrix multplication. As said * does not work, we have to use dot function:

print(np.dot(m, m.T))

[[ 14.  20.]
[ 20.  29.]]


Python 3.5 introduced the new operator @ solely for the purpose of simplifying matrix multiplication in numpy code:

print(m @ m.T)

[[ 14.  20.]
[ 20.  29.]]


This notation is much more readoble for longer expressions and very similar to the mathematical notation:

print(m @ m.T @ m @ m.T)

[[  596.   860.]
[  860.  1241.]]

np.dot(m, np.dot(m.T, np.dot(m, m.T)))

array([[  596.,   860.],
[  860.,  1241.]])

## Solving linear equations¶

To solve an equation system

$x + 2 y + 3 z = 14$

$x + y = 3$

$y + 2 z = 8$

A = np.array( [[1, 2, 3], [1, 1, 0], [0, 1, 2]], dtype=float)
print(A)

[[ 1.  2.  3.]
[ 1.  1.  0.]
[ 0.  1.  2.]]


The right hand side is

y = np.array([14, 3, 8], dtype=float)


And the solution is computed using linalg.solve from numpy:

x = np.linalg.solve(A, y)
print(x)

[ 1.  2.  3.]


To check our result, we "insert x into the equation" and compare it toy:

print(np.dot(A, x) - y)

[ 0.  0.  0.]

print(A @ x - y)

[ 0.  0.  0.]

# THE LINES BELOW ARE JUST FOR FORMATTING THE INSTRUCTIONS ABOVE !
from IPython import utils
from IPython.core.display import HTML
import os
def css_styling():
"""Load default custom.css file from ipython profile"""
base = utils.path.get_ipython_dir()
styles = """<style>

body {
font-family: 'Lora', Consolas, sans-serif;

}
.rendered_html code
{
color: black;
background: #eaf0ff;
font-family:  'Source Code Pro', Consolas, monocco, monospace;
}

.CodeMirror pre {
font-family: 'Source Code Pro', monocco, Consolas, monocco, monospace;
}

.cm-s-ipython span.cm-keyword {
font-weight: normal;
}

strong {
background: #ffe7e7;
}

div #notebook {
# font-size: 10pt;
line-height: 145%;
}

li {
line-heigt: 145%;
}

div.output_area pre {
background: #fffdf0;
}

h1, h2, h3, h4 {
font-family: Kameron, arial;
}

div#maintoolbar {display: none !important;}
</style>"""
return HTML(styles)
css_styling()

/Users/uweschmitt/Projects/python3-course/lib/python3.5/site-packages/IPython/utils/path.py:258: UserWarning: get_ipython_dir has moved to the IPython.paths module
warn("get_ipython_dir has moved to the IPython.paths module")


`