About the standard library

A standard Python installation not only provides the Python interpreter, but also a huge collection of modules.

The reference documentation can be found at https://docs.python.org/3/library/index.html.

The website "python module of the weeks" (https://pymotw.com/3/) introduces a filtered selection of the standard libary with more examples and a less techincal explanations.

import math

print(math.pi)
print(math.cos(math.atan(1)))
3.141592653589793
0.7071067811865476

Python also supports complex numbers, the complex unit is written as j:

1j ** 2
(-1+0j)
import cmath
print(cmath.sqrt(-1))

# exp(pi * i) == -1
print(cmath.exp(math.pi * 1j))
1j
(-1+1.2246467991473532e-16j)

The statistics module offers numerically robust implementations of basic statistics:

import statistics
print(statistics.median(range(12)))
print(statistics.variance(range(12)))
5.5
13.0

random offers pseudo random generators for different distributions:

import random
print(random.gauss(mu=1.0, sigma=1.0))
print(random.uniform(2, 3))
1.7450032322906668
2.7788173089790766
print(1.1 + 2.2)

# supports fractions:
import fractions

print("using fractions")
f = fractions.Fraction(11, 10) + fractions.Fraction(22, 10)
print(f)
print(float(f))


# supports arbitrary precise floats
import decimal

print("using decimal")
f = decimal.Decimal('1.1') + decimal.Decimal('2.2')
print(f)
print(float(f))
3.3000000000000003
using fractions
33/10
3.3
using decimal
3.3
3.3
float(f)
3.3

Datastructures

A defaultdict is a dictionary-like data structure with a specified default value for unkown keys.

The signature is defaultdict(function) where function() delivers the defaultvalue.

from collections import defaultdict

d = defaultdict(lambda: 3)

print(d[0])
3

Here are two typical use cases:

# int() results in 0:
int()
0

Thus defaultdict(int) can be used to simplify counting:

data = "adffjjkjwet"

counter = defaultdict(int)
for c in data:
    counter[c] += 1
    
print(counter.items())
dict_items([('a', 1), ('d', 1), ('f', 2), ('j', 3), ('k', 1), ('w', 1), ('e', 1), ('t', 1)])

And defaultdict(list) for grouping data:

list()
[]
grouped_values = defaultdict(list)

values = [1, 2, 3, 2, 1, 3, 4]
groups = [0, 1, 1, 0, 1, 1, 0]

for g, v in zip(groups, values):
    grouped_values[g].append(v)
    
print(grouped_values)
for g, values in grouped_values.items():
    print('average of group', g, 'is', sum(values) / len(values))
defaultdict(<class 'list'>, {0: [1, 2, 4], 1: [2, 3, 1, 3]})
average of group 0 is 2.3333333333333335
average of group 1 is 2.25

The previous example for counting can be simplified further:

from collections import Counter

c = Counter(data)
print(c)
print(c.most_common(1))
Counter({'j': 3, 'f': 2, 'a': 1, 'd': 1, 'k': 1, 'w': 1, 'e': 1, 't': 1})
[('j', 3)]

Python tuples are helpful four grouping data, the namedtuple extends this by assigning names to the elements:

from collections import namedtuple

Point = namedtuple("Point", ["x", "y", "z"])

p = Point(1, 2, 3)
print(p.x)
1
print(p[0])
1

Similar to the builtin tuple type, a namedtuple is immutable:

p.y = 2
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-738d48599776> in <module>
----> 1 p.y = 2

AttributeError: can't set attribute
# only mentioned here
import queue
import heapq
import itertools  # combinatorics and more

LRU-cache

import time

# secons since 1.1.1970:
print(time.time())
1547825754.387507

Number of function calls grows exponentially for this implementation of fibionacci numbers:

def fib(n):
    if n < 2:
        return 1
    return fib(n - 1) + fib(n - 2)
started = time.time()
print(fib(36))
print(time.time() - started)
24157817
5.495759010314941

A cache prevents repeated function evaluations for known pairs of (arguments, return value):

import functools

@functools.lru_cache()
def fib_cached(n):
    if n < 2:
        return 1
    return fib_cached(n - 1) + fib_cached(n - 2)
started = time.time()
print(fib_cached(36))
print(time.time() - started)
24157817
0.00025081634521484375
print(fib_cached.cache_info())
CacheInfo(hits=34, misses=37, maxsize=128, currsize=37)

LRU means "last recent update" which means that cache entries are discared based on their last usage.

The default lru_cache holds up to 128 entries. This can me modified another numerical value:

@functools.lru_cache(maxsize=2)
def fib_cached(n):
    if n < 2:
        return 1
    return fib_cached(n - 1) + fib_cached(n - 2)

started = time.time()
print(fib_cached(36))
print(time.time() - started)

print(fib_cached.cache_info())
24157817
0.09180808067321777
CacheInfo(hits=59863, misses=287964, maxsize=2, currsize=2)

Comment: if you use maxsize=None the cache is unlimited, but this could use up all your memory !

Regular expressions

Regular expressions are helpful for parsing complex strings.

Here we look for a sequence stating with a, followed by 0 or more digits, and finally terminated by a second a:

import re
m = re.search("a([0-9]*)a", "xyza12345abc")
m.group(0)
'a12345a'
m = re.search("a([0-9]*)a", "xyza12345abca111ax")
print(m.group(0))
print(m.group(1))
a12345a
12345
m = re.search("a([0-9]*)a", "xyz12345abc111ax")
print(m)
None

File system operations

import os
import pprint  # pretty print

current = os.getcwd()
os.chdir("/tmp")

print()
print("files in current folder", os.getcwd())
pprint.pprint(os.listdir("."))
print()

os.chdir(current)

if not os.path.exists("abc"):
    print("abc does not exist")

# touch:
open("abc", "w").close()

if os.path.exists("abc"):
    os.remove("abc")
    print("deleted abc")    
files in current folder /private/tmp
['.keystone_install_lock',
 '.X0-lock',
 'com.apple.launchd.6oHJuYQO9p',
 'com.apple.launchd.D08C15ACl9',
 '.ICE-unix',
 '09_proposed_solutions.ipynb',
 'powerlog',
 '.X11-unix',
 'boost_interprocess',
 'com.apple.launchd.OoMSdsou9l',
 'kpxc_server',
 '.font-unix',
 'tmux-501',
 'cvcd']

abc does not exist
deleted abc
print(os.path.abspath("."))
/Users/uweschmitt/Projects/python-course-one-day/src
print(os.path.dirname(os.path.abspath(".")))
print(os.path.basename(os.path.abspath(".")))
print(os.path.join("a", "b", "c.txt"))
/Users/uweschmitt/Projects/python-course-one-day
src
a/b/c.txt
print(os.path.splitext("abc.py"))
('abc', '.py')
if not os.path.exists("/tmp/a/b/c/d"):
    os.makedirs("/tmp/a/b/c/d")

Recent Python 3 versions also include the pathlib module, which makes path manipulations more expressive compared to using os.path.

import pathlib
here = pathlib.Path(".")
print(here)
print(here.resolve())
print(here.resolve().parent.parent)
print(here.resolve().parent.parent / "xyz")
.
/Users/uweschmitt/Projects/python-course-one-day/src
/Users/uweschmitt/Projects
/Users/uweschmitt/Projects/xyz

Iterate over files in folders based on "globbing pattern":

import glob
for nb in glob.glob("/private/*/*.ipynb"):
    print(nb)
/private/tmp/09_proposed_solutions.ipynb
import shutil
shutil.copytree
shutil.rmtree
shutil.copy
<function shutil.copy(src, dst, *, follow_symlinks=True)>

Date and time handling

from datetime import datetime

n = datetime.now()
print(type(n))
print(n)
print(n.month, n.hour)
<class 'datetime.datetime'>
2019-01-18 16:36:00.155517
1 16
n = datetime.now()
time.sleep(1)
delta = datetime.now() - n
print(type(delta))
print(delta)
<class 'datetime.timedelta'>
0:00:01.003714

Formatting and parsing data time values:

datetime.strftime
datetime.strptime
<function datetime.strptime>

Data persistence

import pickle

complex_data = {0: [1, 2], 1: {2: [1, 2, (3, 4)]}}

bytestream = pickle.dumps(complex_data)
print(bytestream)
b'\x80\x03}q\x00(K\x00]q\x01(K\x01K\x02eK\x01}q\x02K\x02]q\x03(K\x01K\x02K\x03K\x04\x86q\x04esu.'
back = pickle.loads(bytestream)

print(back)
print(back == complex_data)
{0: [1, 2], 1: {2: [1, 2, (3, 4)]}}
True
pickle.dump
pickle.load
<function _pickle.load(file, *, fix_imports=True, encoding='ASCII', errors='strict')>

sqlite3 is a database system without management overhead. A sqlite database is just a plain file.

sqlite3 is not efficient for multi-client access, but very fast (also for large databases) when accessed from a single client, and as such is often used as a application data format. See also https://en.wikipedia.org/wiki/SQLite#Notable_users

import os
import sqlite3

if os.path.exists("data.db"):
    os.remove("data.db")

db = sqlite3.connect("data.db")
db.execute("CREATE TABLE points (x REAL, y REAL, z REAL);")

points = [(i, i + 1, i + 2) for i in range(10)]

db.executemany("INSERT INTO points VALUES (?, ?, ?)", points)
db.commit()

query = db.execute("SELECT x, y, z, x + y + z FROM points WHERE x > 3 AND z < 8")
for row in query.fetchall():
    print(row)
(4.0, 5.0, 6.0, 15.0)
(5.0, 6.0, 7.0, 18.0)

sqlite3 also shines for spatial data and fuzzy text search.

The copy module

data = (1, [2, 3], 3)
data_copy = data

Assignment using = only creates another name for the existing object. Thus:

data_copy is data
True

The copy module provides shallow and deep copying usiong copy.copy and copy.deepcopy:

import copy

data_copy = copy.copy(data)
print(data_copy is data)
print(data_copy[1] is data[1])
True
True
data_copy = copy.deepcopy(data)
print(data_copy is data)
print(data_copy[1] is data[1])
False
False

Data compression and special file formats

import zlib, zipfile, gzip, tarfile
import csv

Multi core execution

from concurrent.futures import ProcessPoolExecutor
import multiprocessing
import os
import time


def compute(argument):
    started = time.time()
    print("process", os.getpid(), "starts computation for argument", argument)
    
    time.sleep(argument)
   
    print("process", os.getpid(), "finished computation for argument", argument)
    return (os.getpid(), argument, time.time() - started)



n = multiprocessing.cpu_count()
print("number cores=", n)

started = time.time()
with ProcessPoolExecutor(n - 1) as p_pool: # n workers might freeze machine until computations are finished.
    
    for worker_id, argument, needed in p_pool.map(compute, (3, 1, 2, 1, 2, 3, 1, 1, 2)):
        print(f"worker {worker_id} got argument {argument} and needed {needed:.2f} seconds")
    
print("overall time {:.2f} seconds".format(time.time() - started))
number cores= 8
process 58090 starts computation for argument 1
process 58092 starts computation for argument 1
process 58091 starts computation for argument 2
process 58089 starts computation for argument 3
process 58095 starts computation for argument 1
process 58093 starts computation for argument 2
process 58094 starts computation for argument 3
process 58095 finished computation for argument 1
process 58090 finished computation for argument 1
process 58092 finished computation for argument 1
process 58092 starts computation for argument 2
process 58090 starts computation for argument 1
process 58091 finished computation for argument 2
process 58093 finished computation for argument 2
process 58090 finished computation for argument 1
process 58089 finished computation for argument 3
process 58094 finished computation for argument 3
process 58092 finished computation for argument 2
worker 58089 got argument 3 and needed 3.01 seconds
worker 58090 got argument 1 and needed 1.01 seconds
worker 58091 got argument 2 and needed 2.01 seconds
worker 58092 got argument 1 and needed 1.01 seconds
worker 58093 got argument 2 and needed 2.01 seconds
worker 58094 got argument 3 and needed 3.01 seconds
worker 58095 got argument 1 and needed 1.01 seconds
worker 58090 got argument 1 and needed 1.01 seconds
worker 58092 got argument 2 and needed 2.01 seconds
overall time 3.06 seconds

You can see that the 9 functions evaluations where distributed to 7 workers such that some workers performed multiple function evaluations.

In total the first evaluation needed longest and dominates the overall runtime.

Calling external software

The easiest way to call external executables is os.system. You don't see generated output, the return value is 0 for successful execution.

The following example assumes that you work on linux, so you should adapt it if you work on Windows:

print(os.system("ls -al"))
0

The subprocess module is more versatile and allows finer grained access to stdin and / or stdout of the executable:

import subprocess

p = subprocess.check_output("ls -al *.ipynb", shell=True)
print(str(p, "utf-8"))
-rw-r--r--  1 uweschmitt  staff   11672 Oct 20  2017 check_questions.ipynb
-rw-r--r--  1 uweschmitt  staff   30665 Jan 17 18:50 object_oriented_programming_introduction.ipynb
-rw-r--r--  1 uweschmitt  staff   93458 Oct  4  2017 reference.ipynb
-rw-r--r--  1 uweschmitt  staff  166543 Jan 17 22:07 script.ipynb
-rw-r--r--  1 uweschmitt  staff   38028 Jan 18 16:32 selected_modules_from_the_standard_library.ipynb
-rw-------  1 uweschmitt  staff   31600 Jun 12  2018 solutions.ipynb

Here we start a Python process (-i is crucial to make this work), and remotely "enter" a line of code and capture the output.

p = subprocess.Popen("python -i -u -B", shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)

print(p.stdout.readline())
print(p.stdout.readline())
print(p.stdout.readline())

p.stdin.write(b"print(2 ** 10);\n")  # remove this later and try again
p.stdin.flush()

print("read result")
print(p.stdout.readline())

p.terminate()
b'\n'
b'auto complete with tab enabled\n'
b'pretty print display hook installed\n'
read result
b'1024\n'

Such code is fragile and prone to hanging. Just remove the indicated line and you will see that the p.stdout.readline() call hangs.

This does not mean, that this approach is not recommended, but you have to think about a communication protocol including error handling.

Typehints

Although Python is dynamically typed, one can add type information to variables and functions since Python 3.5.

Such type annotations DONT CHECK FOR TYPES. They can be used for documenting code and can be accessed by exteranl tools like mypy to check potential type conflicts.

There are also external libraries which perform the type checks during runtime: E.g. https://github.com/agronholm/typeguard

PyCharm also offers type checking, see this tutorial and the PyCharm Documentation

Values are annoatated with a :, except return values we use ->:

def add_integers(a: int, b:int) -> int:
    c : int = a + b
    return c

In case you are curious, the annotations are stored in the __annotations__ attribute of the function:

print(add_integers.__annotations__)
{'a': <class 'int'>, 'b': <class 'int'>, 'return': <class 'int'>}

As one can see the type checks are not performed:

add_integers("a", "b")
'ab'

The typing from the standard library allows declaration of more complex types. For example this does not work:

def a(x: list[int]) -> int:
    pass
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-60-1c61f426d91f> in <module>
----> 1 def a(x: list[int]) -> int:
      2     pass

TypeError: 'type' object is not subscriptable

Instead this works:

import typing

def add_many(values: typing.List[float]) -> float:
    return sum(values)

There is also a more abstract type declaration from typing which is more general:

def add_many(values: typing.Sequence[float]) -> float:
    return sum(values)

Another example is types.Union which can be read as "or":

Number = typing.Union[bool, int, float]

def add_numbers(a: Number, b: Number) -> Number:
    return a + b