WARNING: This script is work in progress

Many chapters are still empty or incomplete.

More and more people code nowadays.

Learning to program is becoming increasingly important in many curricula, both at school and at university. In addition, the emergence of modern and easy to learn programming languages, such as Python, is lowering the entry threshold for taking first steps in programming and attracting novices of all ages from various backgrounds.

As such, programming is not a skill mastered by a small group of experts anymore, but is instead gaining importance in general education, comparable to mathematics or reading and writing.

The idea for this book came up whilst working as a programmer and consultant for ETH Zurich Scientific IT Services. Whether it was during code reviews, code clinics or programming courses for students, researchers, or data scientists, we realized that there is a big need among our clients to advance in programming, and to learn best practices from experienced programmers.

Whilst the first steps in programming focus on getting a problem solved, professional software development is about writing robust, reliable and efficient code, programming in teams, and best practices for writing programs spanning more than a few hundred of lines of code.

Although plenty of established books about such advanced topics exist, most of them are comprehensive and detailed, and require programming experience in advanced topics such as object oriented programming. Beyond that, these books often target programmers already making a living from software development.

On the other hand, introductions to the topics we offer exist but fragmented on the internet. Regrettably theses are not collected and offered in one single place and also are not selected and presented uniformly accordingly to our beliefs.

Regrettably, our clients’ motivation to progress is usually constrained by available time, and by other tasks having higher priority. Answers along the lines of "I fully agree that my code would benefit from your suggested changes, but I have no time for this" are no exception. As a result, we often do not recommend existing books, but instead focus on the most important concepts and on pragmatic, efficient and basic instructions to benefit our clients as efficiently as possible in the face of the given constraints.

This book follows this way of thinking. Instead of trying to provide an all-embracing and deep introduction into the principles of professional software development, our attempt is to select topics we regard as being the most important, and to offer sound introductions and explanations to the underlying concepts and principles. The reader should learn enough to get started and be able to dig deeper by learning from more advanced resources. We also attempt to split the book into chapters with minimal interdependence, so that readers can pick parts according to their interests and existing knowledge.

1. About this book

2. The programming process

3. General principles and techniques

4. Clean code

You might not believe it, but the following program is valid Python code and prints Hello world!:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
# hello_world.py (lambda _, __, ___, ____, _____, ______, _______, ________: getattr( __import__(True.__class__.__name__[_] + [].__class__.__name__[__]), ().__class__.__eq__.__class__.__name__[:__] + ().__iter__().__class__.__name__[_:][_____:________] )( _, (lambda _, __, ___: _(_, __, ___))( lambda _, __, ___: bytes([___ % __]) + _(_, __, ___ // __) if ___ else (lambda: _).__code__.co_lnotab, _ << ________, (((_____ << ____) + _) << ((___ << _____) - ___)) + (((((___ << __) - _) << ___) + _) << ((_____ << ____) + (_ << _))) + (((_______ << __) - _) << (((((_ << ___) + _)) << ___) + (_ << _))) + (((_______ << ___) + _) << ((_ << ______) + _)) + (((_______ << ____) - _) << ((_______ << ___))) + (((_ << ____) - _) << ((((___ << __) + _) << __) - _)) - (_______ << ((((___ << __) - _) << __) + _)) + (_______ << (((((_ << ___) + _)) << __))) - ((((((_ << ___) + _)) << __) + _) << ((((___ << __) + _) << _))) + (((_______ << __) - _) << (((((_ << ___) + _)) << _))) + (((___ << ___) + _) << ((_____ << _))) + (_____ << ______) + (_ << ___) ) ) )( *(lambda _, __, ___: _(_, __, ___))( (lambda _, __, ___: [__(___[(lambda: _).__code__.co_nlocals])] + _(_, __, ___[(lambda _: _).__code__.co_nlocals:]) if ___ else [] ), lambda _: _.__code__.co_argcount, ( lambda _: _, lambda _, __: _, lambda _, __, ___: _, lambda _, __, ___, ____: _, lambda _, __, ___, ____, _____: _, lambda _, __, ___, ____, _____, ______: _, lambda _, __, ___, ____, _____, ______, _______: _, lambda _, __, ___, ____, _____, ______, _______, ________: _ ) ) )

If you want to understand how this program works, continue reading here.

Writing such obfuscated code maybe fun and challenging. It also may increase your reputation as a coding wizard, but you should definitely invest your mental energy better than into writing such code if you want to be regarded as a professional software developer.

Why clean code?

One common rule is You read code more often than you write it. Let it be because you want to find a bug, you want to extend or improve code, or because others want to learn from it.

So always consider that you might want to understand your own code in weeks or months and, as a non-selfish and team oriented programmer, that this is also the case for others.

Another idea behind good code, is that your code should communicate your solutions ideas and concepts. So code not only instructs your computer to solve a problem, but is a means of communication with humans also.

So, write code for humans, not for computers !

About clean code principles

What we present and introduce in this chapter should not be considered as hard rules. As for almost all rules reasonable exceptions exist in real life.

4.1. How does clean code look like ?

What are the hallmarks of readable code ? Following principles are the basis for clean code. Although we list a few examples how to realize these principles, more elaborate explanations and techniques will follow in the following sections:

  • Little scrolling needed. To understand how a "unit" in your code works should require no or little scrolling. This e.g. can be supported by writing short functions and by keeping interdependent parts together.

  • Minimal level of indentation: Loops and branches introducing more than two levels of indentation can be hard to understand, especially if they don’t fit on one single page on your screen. Writing functions can circumvent this. This is also mentioned in the Zen of Python. Newspapers having narrow columns follow the same principle: short and many rows are preferred over long and little rows.

  • Clear naming: Well chosen names are always up-to-date with the code and should support understandability of your code and not confuse the reader. Good names also can avoid unneeded comments and thus contribute to shorter code.

  • Directly understandable: This includes avoiding programming tricks which result in short one liners but will require extra mental work when you or others read this. The code layout should also follow your solution strategy.

  • No surprises: As an example names and comments should never contradict your approach or confuse and cause deep wrinkles on the readers forehead.

  • Consistency: Among st other things, naming of variables and functions should be consistent. So either use index as a variable name in all places or idx, but not both !

  • Don’t litter: Keeping out-commented lines of code or unused functions in your code only increases code size and decreases understandability. Use a Version Control System instead.

4.2. General comments about code layout

4.2.1. The usefullness of empty lines

Don’t underestimate the effect of using empty lines to structure your code. An empty line separating two blocks of code guides your eye and makes it easier to focus on such a block. So group related fragments together and separate them with single empty lines. But don’t overuse this !

Try to structure your code in a way that the flow of operations is as linear as possible. Also try to reduce scattering of related operations, such as initializing and using a specific object.

Example

The following example already uses empty lines to structure the code. But operations such as opening the result file and writing to it, as well as initializing lists and appending data, are scattered:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# grouping_bad.py with open("results.txt", "w") as fh: filtered_numbers = [] numbers = [2, 4, 6, 8, 9, 10] (1) results = [] for number in numbers: if number % 2 == 0: filtered_numbers.append(number) (2) else: break for number in filtered_numbers: results.append(number + 1) (3) results.append(number ** 2) for result in results: print(result, file=fh) (4)
1 The actual data we want to process is defined in the middle of the script.
2 Your eyes have to move up multiple lines to check the initial value and type of filtered_numbers.
3 Even more lines to scan to find the initial value and type of results.
4 You have to read the top of the file to check the used file name and opening mode. Opening and writing is scattered over the script.

The distinct steps

  • define input data

  • filter data

  • compute results

  • write results

are clearly visible in the reordered code now:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# grouping_good.py numbers = [2, 4, 6, 8, 9, 10] (1) filtered_numbers = [] (2) for number in numbers: if number % 2 == 0: filtered_numbers.append(number) else: break results = [] (3) for number in filtered_numbers: results.append(number + 1) results.append(number ** 2) with open("results.txt", "w") as fh: (4) for result in results: print(result, file=fh)
1 The data to process now stands out at the beginning of the script.
2 Initialization of filtered_numbers is now closer to the following filtered_numbers.append.
3 Same for results.
4 The chosen file name and file opening mode are now also close to the usage of the file handle.

4.2.3. PEP 8

A code style guide gives general advice on how to format your code to improve readability. If all developers involved in a project follow the same style the code is more consistent, and readable for all people involved.

The preferred style used by Python programmers is called PEP 8. More about PEPs in general here.

The most important rules of PEP 8 are:

  • Use multiples of four spaces for indentation.

  • Don’t mix tabs and spaces for indentation. Most IDEs and code editors allow automatic conversion, maybe you have to enable this in your editors settings.

  • Only use lowercase letters, underscores and digits for variable and function names. This is also called snake case. E.g. input_file_folder. For class names use camel case, this means uppercase characters for the first character of words, else lowercase letters. E.g. InputFileHandler. For constants user uppercase characters and underscores, e.g. DAYS_PER_WEEK.

  • Use spaces around assignment =, comparisons like != and algebraic operations like +. But no spaces around = for named function arguments or declaring functions with default values.

  • Use one space after ,

  • no spaces after [ or before ], same for ( and ).

  • Maximum 79 characters per line. This is often weakened to 99 characters, but not more !

  • Two empty lines between functions and classes, one empty line between methods.

This is an example for non PEP 8 compliant code:

1 2 3 4 5 6 7 8
# non_pep8.py class adder: def __init__(self,x,y): self.x=x self.y=y def addNumbers(numberOne,numberTwo): return numberOne+numberTwo

And this is the same code after reformatting for PEP 8 compliance:

1 2 3 4 5 6 7 8 9 10
# pep8.py class Adder: def __init__(self, x, y): self.x = x self.y = y def add_numbers(number_one, number_two): return number_one + number_two
As code in a book reads differently than in a code editor, some of the examples in this book are not PEP 8 compliant, e.g. we use often only one empty line to separate function definitions.
Use tools like pep8 to check your code from time to time. Some IDEs also indicate PEP 8 violations or offer tools to format your code for PEP 8 compliance.

4.2.4. Don’t postpone naming decisions and renaming

Don’t postpone choosing good names when you write code the first times. Don’t postpone rename in case names get out of date during development. Not adapting names to introduced changes can turn your names into lies !

4.3. About good names

This chapter introduces some basic rules for good names. Let it be for variables, functions or also classes.

Motivation example

Can you spot the error in the following example ?

1 2 3
# bad_name.py def compute(a, b): return a + b

Compare this to the following code where we use expressive names for the function and the variables:

1 2 3
# good_name.py def rectangle_area(height, width): return height + width
Good names reveal intention and minimize comments.

Giving functions and variables unambiguous names expressing the programmers intention can reduce the need for comments. You could try to fix the first example above by adding a comment that compute computes the area of an rectangle. The second version does not require a comment, is shorter and directly understandable. So prefer names like first_names over the_list or fn.

Encoding the unit of a physical quantity in a variable name makes sense to reduce errors and to avoid comments. So is time_since_event not the worst name but time_since_event_in_seconds will be much clearer (if the unit is fixed and does not change based on other factors).

Consistent names

I use either idx or index as name or part of names in the same project. This reduces mental mapping and also searching for already used names. I also use singular vs plural for distinguishing elements and a collection of elements. E.g. for first_name in first_names:.

Prefer pronounceable names.

Your brain prefers pronounceable names when reading code. So prefer full words over inconsistent abbreviations.

Prefer names which relate to existing subjects.

The term "iteration index" is a term you can use in spoken language when you communicate with humans, thus prefer the variable name iteration_index over index_iteration.

Avoid encoding types in names.

If you encode the data type of a variable in its name, you have to change the names as soon as the data type changes. Else your code will cause confusion. So better use names instead of name_list, or name_to_phone_number instead of phone_number_dict.

In the olden days, when Fortran 77 was a modern programming language and the length of variable and function names where restricted, the so called Hungarian notation was a commonly used convention how to encode types into names. As said, this was in the olden days.

Remove noise

Why use a_name if name fulfills the same purpose ? Why income_amount if income is sufficient ? If a spelling error sneaks into a name, fix this instead of reproducing the same error all over the code, you might want to search for the name.

General names support re-usability

Choose names as general as possible. In the improved example above we could have used def window_area(window_height, window_width) instead, because this is how we use this function in our current house construction project. The choice of more general names here make this function reusable within other projects without modification.

Using window in the names here also introduces noise as the computation here is not specific to a window, and the computation will also only be correct if the window is rectangular.

Subjects or verbs ?

As classes in most cases represent entities, and methods often are considered as actions, one should try to choose nouns for class names, and verbs for method names. For functions verbs should be preferred, but sometimes nouns are also ok. E.g. we used rectangle_area above and not compute_rectangle_area, as area = rectangle_area(h0, w0) reads well and adding compute_ would introduce noise.

Exceptions

A common convention is to use i, j, and k as variable in a counting loop, names as n and m for natural numbers, eg. for the length of a list, or for the dimensions of a matrix. This is ok for names of local variables which are only used within a small range of lines of code.

4.4. Explanatory variables

Introducing extra variables can enhance readability by far. So you can name sub-expressions of larger mathematical formulas, and avoid error prone duplications. If the names are well chosen extra variables introduce extra information for the reader and thus enhance understandability.

The following code assumes that config is a nested dictionary holding the configuration settings for some data analysis:

1 2 3 4 5 6
# repeated_expressions.py def analyze(config, data): eps = config['numerics']['eps'] max_iteration_count = config['numerics']['max_iteration_count'] max_relative_error = config['numerics']['max_relative_error'] ...

By introducing a meaningful variable numerics for a repeated sub-expression we reduce code duplication, thus the probability of typing mistakes. The result is also more pleasant to read. In case you rename the key for the numerical parameters section, only one change is required:

1 2 3 4 5 6 7
# with_sub_expressions.py def analyze(config, data): numerics = config['numerics'] eps = numerics['eps'] max_iteration_count = numerics['max_iteration_count'] max_relative_error = numerics['max_relative_error'] ...

Another example using an explanatory variable to simplify mathematical expressions:

1 2 3 4 5 6 7 8 9 10 11 12
# quadratic_before.py import math def solve_quadratic(p, q): """solves x^2 + p x + q == 0""" if p * p - 4 * q < 0: return None, None x1 = -p / 2 + math.sqrt(p * p - 4 * q) / 2 x2 = -p / 2 - math.sqrt(p * p - 4 * q) / 2 return x1, x2 assert solve_quadratic(-3, 2) == (2, 1)

The term \$p^2 - 4q\$ has a mathematical meaning and appeared three times. This is the improved version using a variable determinant:

1 2 3 4 5 6 7 8 9 10 11 12 13
# quadratic_after.py import math def solve_quadratic(p, q): """solves x^2 + p x + q == 0""" determinant = p * p - 4 * q if determinant < 0: return None, None x1 = -p / 2 + math.sqrt(determinant) / 2 x2 = -p / 2 - math.sqrt(determinant) / 2 return x1, x2 assert solve_quadratic(-3, 2) == (2, 1)
Explanatory constants

Literal values like 24 or 60 in the context of time calculations may be understandable, but the intention of mem_used / 1073741824 requires a comment:

1
mem_used /= 1073741824 # bytes to gb

Using a constant variable for such magic values is the better solution, especially when such calculations appear multiple times in different parts of your code. Usually constants are declared at the top of a script after the import statements.

1 2 3
BYTES_PER_GB = 1073741824 ... mem_used /= BYTES_PER_GB

4.5. About good comments

The less comments you have to write the better. This is a bold statement and the idea of reducing the amount of comments might be contrary to your current coding practice.

The reason for this is that comments can turn into lies if you change your code but don’t adapt your comments. This is also an example for avoidable redundancy.

The trick is to reduce the amount of comments needed by choosing good names and decomposing your code into small functions.

Comment the unexpected

Only comment what you can not directly express in your code. Clean code only comments the unexpected or non-obvious. E.g. add comments to tricks to speed-up code, or copy-pasted solutions from stack overflow. In the latter case I add a comment containing the full link to the corresponding solution. This honours the one who spent time to solve your problem and the comment also adds a link to the full documentation of the underlying problem and its solution.

Comments should describe intention

This comment is redundant and describes what we already know:

1
tolerance = tolerance / 60 # divide by 60

But this comment explains the intention:

1
tolerance = tolerance / 60 # seconds to minutes
Will I myself or others understand my comments later ?

When writing comments you can be deep in the programming process and comments will appear understandable. Often this is a misjudgement, because your mind is familiar with the current context. So review your comments and consider if you or others will be able to understand them weeks or months later.

Use docstrings. No comments before function declaration !

Don’t comment your functions by writing comment lines before the def line. Python offers a better mechanism: If you declare a string after def and before the first actual statement of the body of the function this string is called docstring and automatically contributes to Pythons built-in help system:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# docstring_example.py def gcd(m, n): """computes greatest common divisor of two numbers (1) args: m (int) - first number n (int) - second number returns: gcd (int) - greatest common divisor of m and n (2) """ while m != n: if m > n: m -= n else: n -= m return m assert gcd(24, 15) == 3 (3) assert gcd(1024, 17) == 1 help(gcd)
1 First line of docstring
2 Last line of docstring
3 Quick checks if function actually works

And this is the output from help(gcd):

1 2 3 4 5 6 7 8 9 10
$ python docstring_example.py Help on function gcd in module __main__: gcd(m, n) computes greatest common divisor of two numbers args: m (int) - first number n (int) - second number returns: gcd (int) - greatest common divisor of m and n
No dead code

Often we deactivate code by turning it into comments. This is fine for experiments, but unused code should be removed as early as possible. The same holds for unused functions. The idea of Version Control Systems is to track changes of your source code over time and to allow restoring files from older versions.

4.6. About good functions

Functions are fundamental building blocks for larger programs and we already demonstrated the beneficial effects of using functions to avoid comments. Here we discuss a few recommendations for writing good functions additionally to choosing good names.

Look at your functions from the callers perspective

To decide if a function name is a good name, and if the way you decompose your problem into functions is well chosen, should mainly be done from the callers perspective.

Ask yourself if the code sections calling your functions are readable, and if another function name or other arguments could improve this. Ask yourself also if the way you call and combine your functions appers to be intuitive and straightforward.

The later introduced technique of test driven development supports this practice.

A function should do one thing only

Lets start with an example of a function not doing one thing: The function check_prime below asks for a number, checks if its prime, prints some output and returns a result.

1 2 3 4 5 6 7 8 9 10 11 12 13
# multifunc.py def check_prime(): n = int(input("number to check ? ")) divisor = 2 while divisor * divisor <= n: if n % divisor == 0: print("no prime") return False divisor += 1 print("is prime") return True check_prime()

And this is an improved version:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# is_prime.py def is_prime(n): (1) divisor = 2 while divisor * divisor <= n: if n % divisor == 0: return False divisor += 1 return True def ask_user_and_check_prime(): (2) n = int(input("number to check ? ")) if is_prime(n): print(n, "is prime") else: print(n, "is not prime") assert is_prime(7) assert not is_prime(8) assert not is_prime(9) ask_user_and_check_prime()
1 We extracted a function is_prime which returns a logical value indicating if the argument n is prime or not.
2 We renamed check_prime to ask_user_and_check_prime and the function has now the purpose of the main task.

Some benefits:

  • is_prime is reusable, the user of this function can decide where the number to check is coming from and if a result should be printed or not.

  • is_prime be tested within an automated testing suite. The former version contained a call of input, such user interaction makes automated testing difficult.

The concept of "one thing" is a bit vague and related decisions sometimes require some gut feeling or experience. This concept gets more cleare in combination with the following sections.
Functions should be small

Long functions spanning multiple pages on a screen are hard to understand and require scrolling, whereas short functions are spatially localized and can be grasped easier.

Splitting up code into small functions also introduces more names, and, provided that the names are well chosen, introduces more information for the reader about how the code works and what parts of the code are intended to do.

Another problem introduced by longer functions is that control statements like if, continue and break result in different execution paths depending on the input arguments, as demonstrated in the following example:

1 2 3 4 5 6 7 8
def analyze(n): if n % 2 == 0: if n > 0: ... else: ... else: ...

Here we see different such execution paths for even and odd numbers, and for even numbers we recognize different paths for positive and non-positive numbers.

Implementing the …​ code parts in separate functions analyze_for_odd_n, analyze_for_positive_even_n and analyze_for_non_positive_even_n shortens the function analyze to a few lines and also introduces separate functions which can be tested independently.

Implementing the …​ code parts within in analyze directly would result in one larger function, and testing this single function would require analysis of all possible paths and how to trigger them.

This was a simple example, in real life applications understanding all such possible paths can be challenging and sometimes not all of them are understood and tested.

One level of abstraction

The code of a function should operate on the same level of abstraction. For example:

1 2 3 4 5
def workflow(): configuration = read_configuration() data = read_data(configuration) results = process(data, configuration) write_result(results, configuration)

The other functions then also should follow this principle. E.g.

1 2 3 4 5 6 7 8
def read_data(configuration): input_path = configuration['input_file'] if input_path.endswith(".csv"): return read_csv(input_path) elif input_path.endswith(".xlsx"): return read_xlsx(input_path) else: raise NotImplementedError('no not know how to read {}'.format(input_file))

A function having multiple sections, maybe including comments to separate and describe the different sections, like # read data and # prepare data, is a good candidate for splitting into different functions.

Avoid flag arguments

Another common pattern to avoid are so called flag arguments. Such function arguments are usually integers, sometimes strings, which modify the overall behaviour of a function.

This is an example using such a flag argument action

1 2 3 4 5 6 7 8 9
# flag_argument.py def update(data, item, action): assert action in (0, 1, 2) if action == 0: data['first_name'] = item elif action == 1: data['second_name'] = item elif action == 2: data['birthday'] = item

Instead we write different functions with appropriate names:

1 2 3 4 5 6 7 8 9
# without_flag_argument.py def update_first_name(data, item): data['first_name'] = item def update_second_name(data, item): data['second_name'] = item def update_birthday(item): data['birthday'] = item
The lesser arguments the better

The more arguments a function has, the harder it is to memorize the arguments, their order and their intention. This also increases the risk of introducing bugs. Function calls with many arguments are also hard to read and understand. A technique to reduce the number of arguments is to group arguments using dictionaries, as attributes of classes, or using namedtuple from the standard libraries collections module.

The argument list of the following function is very long:

1 2 3 4 5 6 7 8
# distance_3d_6_args.py def distance_3d(x0, y0, z0, x1, y1, z1): dx = x1 - x0 dy = y1 - y0 dz = z1 - z0 return (dx ** 2 + dy ** 2 + dz ** 2) ** .5 distance = distance_3d(1, 2, 3, 3, 2, 1) (1)
1 The function call is hard to understand, especially if function definition and function call are not as close as here.

If we introduce a new data type Point3D we obtain a function with only two arguments:

1 2 3 4 5 6 7 8 9 10 11 12
# distance_3d_2_args.py from collections import namedtuple Point3D = namedtuple("Point3D", "x y z") def distance_3d(p1, p2): dx = p2.x - p1.x dy = p2.y - p1.y dz = p2.z - p1.z return (dx ** 2 + dy ** 2 + dz ** 2) ** .5 distance = distance_3d(Point3D(1, 2, 3), Point3D(3, 2, 1)) (1)
1 The function call now reveals the intention of the arguments.

Another technique is to use dictionaries to group configuration settings or algorithmic parameters. The drawback is, that your code only documents the available or required keys within the functions implementation, whereas namedtuples or own classes make this more explicit.

The (at the time of writing) upcoming Python 3.7 release introduces data classes which extend the idea of namedtuples and also introduce automatic type checks.

If grouping arguments does not work a function, I recommend to use named function parameters when calling them. This applies also if you use functions from external libraries.

This example from scipy’s documentation demonstrates this:

1 2 3 4 5 6 7 8
from scipy.interpolate import interp1d import numpy as np x = np.linspace(0, 10, num=11, endpoint=True) (1) y = np.cos(-x**2 / 9.0) f = interp1d(x, y) f2 = interp1d(x, y, kind='cubic')
1 In numpy 1.14 the function linspace has six arguments, two are mandatory, the remaining four are optional to override default values. So choosing named arguments num=11 and endpoint=True here reveals their intention and is also independent of the arguments order in the definition of linspace.
Avoid or minimize side effects

A function without side effects produces always the same result for same arguments, is independent of the environment and also does not change it.

This is the case for purely mathematical functions, but as soon as a function reads from or writes to a file, this criterion is violated. A file which exists on one computer might not exist on another computer, file permissions might be different, etc.

If a function depends on the environment, it might work on one day but not on another day, and if a functions changes the environment, other functions which worked before might now fail. So side effects can introduce some invisible coupling and thus increase complexity which we should keep as low as possible.

Thus functions without side effects are easier to test with unit-tests, easier to understand and debug, and also easier to re-use.

To reduce such side effects:

  • Isolate functions with side effects and keep them small. E.g. avoid functions which interact with the file system, web services or external programs and at the same time perform other operations like checking data for validity or doing some calculations. This overlaps with the concept that a function should only do one thing.

  • Don’t interact with variables outside the functions body. Python has a global statement, but avoid this. While reading only from a global variable is already problematic, this gets even worse if two or more functions communicate over global variables. They are tightly coupled, are hard to re-use in other programs, and are also difficult to understand. If you still feel you need something like global data, look into object oriented programming, the idea of objects is to group such data and operations on it within classes.

No mutable default values

Whereas the former ideas also apply for other programming languages, this section is about an issue specific to Python.

Can you guess what the following program computes ?

1 2 3 4 5 6 7
# mutable_default.py def foo(a=[]): (1) a.append(1) return len(a) print(foo()) print(foo())

Have you been right ?

1 2 3
$ python mutable_default.py 1 2
1 One would assume that a=[] is executed for every call of foo, but the list here is only created once when Python parses the code and constructs the code object for this function. So from call to call this is the same list object. When you call foo the first time, the list is extended to [1], after the second call it is [1, 1] and so on.

Thus foo has a non-obvious side-effect, the chosen default value for a changes from function call to function call.

Choosing mutable default values for function arguments can introduce nasty and hardly detectable bugs. Never use mutable default values like lists, sets, dictionaries. Immutable default values of type int, float, bool, str, tuple don’t cause this problem. Same for None.

To avoid this issue, replace mutable default values with None and handle this default value within the function:

1 2 3 4 5 6 7 8 9
# mutable_default_fixed.py def foo(a=None): if a is None: a = [] a.append(1) return len(a) print(foo()) print(foo())

Now you get a reproducible result:

1 2 3
$ python mutable_default_fixed.py 1 1

4.7. Consistency

Consistency relates to consistent code style, consistent naming and also to consistent error handling.

Consistent code style

Choose a code style for your project and stick to it.

This is important when you develop code in a team but also for one-man projects. Mixing code styles produces a confusing result, and sticking to the same style makes you feel more comfortable reading or extending parts developed by others.

We recommend PEP 8. This is the most often followed style guide for Python code nowadays, and if you get used to it, you find most exising Python code on the internet or from other sources plesant to read.

Beyond that, the more people follow PEP 8, the more consistent Python code gets overall, not only in distinct projects.

If you work on an existing project which did not follow PEP 8, refuse to mix the existing style with PEP 8. Code with mixed styles looks very inconsistent and confusing.
Consistent naming

For example choose either idx or index as names or part of names all over your code.

Consistent naming makes life for programmers much easier by reducing searches for details of given names in the source code.

In a former project I had to read and process text input files with different content. I choose consistent function names like read_site_from_file, read_source_from_file, check_site, check_source, insert_site_into_db and insert_source_into_db etc.

Another example: You develop a program requesting data from different web services. Then consistency would be increased if you use the term fetch throughout in variable and functions names, but also in comments. Inconsistent code would mix fetch and similar terms like retrieve or get.

Consistent error handling

Functions can fail for different reasons. Be consistent to choose a fixed strategy to handle a specific reason.

E.g. assume within our current project some functions have an identifier as argument. All such functions should either return None if this identifier is invalid or raise and (e.g.) ValueError exception, but this should not be mixed.

This makes life for you much easier to implement error handilng if you call such functions.

4.8. Minimize indentation

Deep indentation should be avoided, especially if code spans more than a single page on your screen.

Personal anecdote:
I once was asked to optimize some slow code. Printed on paper the code spanned 20 pages without
declaring any single function. Loops and branches were deeply nested and often spanned
multiple pages. After rewriting the code using functions and reducing indentation we could
spot and fix the slow parts within half an hour.
Use functions to reduce indentation

The following example computes the number of real dividers of natural numbers in a given range. "real dividers" means dividers which are not one or the number itself.

1 2 3 4 5 6 7 8 9 10 11 12
# count_dividers.py def main(n): number_dividers = {} for number in range(2, n): dividers = 0 for divider in range(2, number): if number % divider == 0: dividers += 1 (1) number_dividers[number] = dividers return number_dividers print(main(10))
1 This line has maximal indentation of 16 spaces.
1 2
$ python count_dividers.py {2: 0, 3: 0, 4: 1, 5: 0, 6: 2, 7: 0, 8: 2, 9: 1}

Now we move the inner two loops into a separate function count_dividers.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# count_dividers_with_function.py def count_dividers(number): dividers = 0 for divider in range(2, number): if number % divider == 0: dividers += 1 (1) return dividers def main(n): number_dividers = {} for number in range(2, n): number_dividers[number] = count_dividers(number) return number_dividers print(main(10))
1 This line has maximal indentation by 12 spaces now.
Sidenote: Breaking out of multiple nested loops

Pythons break statement only breaks out of the current loop. To break out of multiple nested loops usually an extra variable tracking the state of the iterations is required.

The following slightly unrealistic example demonstrates this: for n taking values 5, 10 and 15 it searches for the first pair of natural numbers which multiplied yield n:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# multiple_breaks.py def main(): for n in (5, 10, 15): found = False for i in range(2, n): for j in range(2, n): if i + j == n: print(i, j) found = True (1) break if found: (2) break main()
1 We need a variable to inform the loop over i that the inner loop over j found a match
2 And now we need an extra break here.

If me move the two nested loops into a separate function, return allows us to break out much easier:

1 2 3 4 5 6 7 8 9 10 11 12 13
# multiple_breaks_with_return.py def check(n): for i in range(2, n): for j in range(2, n): if i * j == n: print(i, j) return def main(): for n in (5, 10, 15): check(n) main()

So the next time you seed the pattern from the first example, consider to move the affected loops into a separate function.

Use early checks

The following checks if a given number is prime. It raises an exception when the number is not larger than 1.

1 2 3 4 5 6 7 8 9 10 11 12 13 14
# no_early_exit.py def is_prime(n): if n >= 1: divider = 2 while divider * divider <= n: if n % divider == 0: return False divider += 1 return True else: raise ValueError("need number >=1") assert is_prime(17) is True assert is_prime(14) is False

Checking fist and raising and exception allows us to implement the actual prime check with less indentation:

1 2 3 4 5 6 7 8 9 10 11 12 13 14
# early_exit.py def is_prime(n): if n < 1: raise ValueError("need number >=1") divider = 2 while divider * divider <= n: if n % divider == 0: return False divider += 1 return True assert is_prime(17) is True assert is_prime(14) is False

The same technique applies for if within loops. Instead of

1 2 3
for n in numbers: if valid(n): compute(n)

you can check the opposite and use continue to skip the computation:

1 2 3 4
for n in numbers: if not valid(n): continue compute(n)

The visual effect of this technique is more striking if you replace compute by a longer code section, maybe also containing more loops and branches.

4.9. How to achieve a row length of maximum 79 or 99 characters in Python ?

There are multiple ways to split a long line into multiple shorter lines. To fix the following line, which is longer than the 79 characters recommended by PEP 8 …​

1
this_is_a_long_variable = computation_result_long_abc * computation_result_long_def

we can use \ to indicate that the current line will continue with the next line:

1 2
this_is_a_long_variable = computation_result_long_abc \ * computation_result_long_def

The Python interpreter also ignores line breaks if opening brackets are not closed yet:

1 2
this_is_a_long_variable = (computation_result_long_abc * computation_result_long_def)

Or

1 2 3 4 5 6 7 8
addresses = [("peter", "muller", "8001", "zurich"), ("tim", "schmid", "8048", "altstetten"), ] settings = {"numerics" : {"eps": 1e-3, "max_iter": 1000}, "output_folder: "/tmp/results", }

This also works for import:

1 2 3
from math import (cos, sin, tan)

Another not widely known feature of Python is automatic string concatenation:

1 2 3
text = ("abc" "def" "ghi") assert text == "abcdefghi"

which allows us breaking longer strings over multiple lines. The alternative approach using Pythons multi line strings

1 2
text = """abcdefj ghi"""

introduces unwanted spaces and a \n.

In case these methods result in expressions spanning multiple lines, it also can be helpfull to use temporary or explanatory variables to increase readability. This applies escpecially for longer expressions, e.g. mathematical formulas.

5. Git introduction

git is a so called distributed version control system. It supports working on the same code base by a team of developers, but can also be used within one-man projects. The term version control system, also abbreviated as VCS, means that you can track past changes on your files and that you can also restore them or undo previous changes.

git is a kind of time machine for your code. git also allows managing different versions of your code in the same folder structure. So with git you don’t need to encode versions into file or folder names anymore. Say good bye to script_24-12-2017.py and analysis_01_tim_special.py and look forward to a clean and well organized code base.

Often people heard of git in the context of the code repositories github.com or gitlab.com, but git can be used on a single machine without any connection to these repositories.

git was released in 2005 by the Linux developer community. They had to replace their existing VCS and found no suitable alternative to manage contributions from hundreds of developers on a huge code base spanning millions of lines.

git_s command line interface is a complex beast. For daily work you need only a small fraction of the available commands and options. For rare use cases and special solutions to _git related problems search at stackoverflow.com.
git was developed to manage source code or other text files as the asciidoc files for this book, but is not suited to manage binary and / or large files of several MB or larger. Look for git annex and git lfs if your code base contains such files.

5.1. Installing git

All following instructions are on the command line within a terminal. Windows users should find a program called git bash after successful installation which offers a terminal.

To check if git works:

1 2
$ git --version git version 2.11.0

The version number shown on your computer might be different, an exact match is not required for our instructions.

We introduce git here on the command line. Most IDEs nowadays implement their own user interface to interact with git, but they differ. As most solutions on stackoverflow.com are instructions at the command line, you should know how to use git in the terminal anyway.

5.2. Initial Configuration

The following commands should be issued once after installing git. They set required and reasonable default values.

Setting your personal data.

Adapt the values for user.name and user.email to your personal data:

1 2
git config --global user.name "Uwe Schmitt" git config --global user.email "uwe.schmitt@id.ethz.ch"
Windows

On Windows you should find a file named .gitconfig in your home folder. Open this file with a text editor and add the lines:

[core]
editor = 'C:\\Program Files\\Notepad++\\notepad++.exe'

Maybe you have to adapt this if notepad++ is installed in a different folder, the single quotes and double backslashes are important !

Now run

1
git config --global core.autocrlf true

to configure line handling, more about this below.

Linux and Mac

You might adapt the the core.editor setting to your preferred editor, the second setting of core.autocrlf must not be changed:

1 2
git config --global core.editor /usr/bin/nano git config --global core.autocrlf input
Line ending handling

Windows has a encoding for line endings different to the \n used on Mac OS X and Linux. This can cause problems if you use git within a team where developers use different operating systems. Without the specific settings for core.autocrlf above, files with contain the same text will be considered to be different by git because of the mentioned discrepancy in line endings.

Even if you intend to use git locally only, don’t skip this setting. You probably will forget this when you start to use git for distributed development in the future.

Check your settings

When you run

1
git config --global -e

your configured editor should show up. Don’t modify the shown file, close the editor immediately.

5.3. Fundamental concepts

We introduce a few underlying concepts of git here.

Not all our explanations below are 100% exact. We omit some deeply technical details and instead prefer abstractions and concepts which help you to learn and understand git for your daily work.

5.3.1. Patches

A patch is a text snippet which describes the transformation of a given file to a modified version. Usually such patches are created by a computer.

Let’s start with a file before modification

1 2
def rectangle_area(width, height): return width + height

which we fix and improve to:

1 2 3
def rectangle_area(width, height): """computes area of a rectangle""" return width * height

Then this is the patch describing those changes:

1 2 3 4 5
@@ -1,2 +1,3 @@ def rectangle_area(width, height): - return width + height + """computes area of a rectangle""" + return width * height

Lines 3-5 describe the modification: If we remove starting with - from the first version and add the lines starting with + we get the modified file.

  • git is essentially a patch management system.

  • git chains up patches over time.

  • If we know the initial state of an folder and this chain, we can reconstruct the state of the file for any time.

5.3.2. Commits

A commit consists of a single or multiple related patches. The history is an ordered list of commits.

When saying that git is a kind of time machine for your code, this does not work without your contribution: Commits are not created automatically by git, but you have to decide what changes contribute to a commit and to instruct git to create one.

This is an example how such a history looks like. The most recent entry is a the top.

1 2 3 4 5 6 7 8 9 10 11 12
$ git log (1) commit 3e6696e23eb41b19f4119f8327cd2b9fd5e462c9 (2) Author: Uwe <uwe.schmitt@id.ethz.ch> (3) Date: Fri Apr 27 23:56:42 2018 +0200 (4) fixed upper limit + added docstring (5) commit 450b10eb8405130a7e9d7753998b0475d68cd1c8 (6) Author: Uwe <uwe.schmitt@id.ethz.ch> Date: Fri Apr 27 23:50:22 2018 +0200 first version of print_squares.py
1 git log shows the history of our git history, more about this command later.
2 This is the latest commit. the long string after commit is the unique commit id.
3 The commits author.
4 The commits time stamp.
5 Every commit has a commit message describing the commit.
6 This is the commit before the entry we’ve seen above, compare the time stamps !
Commit ids

A commit id is a random string consisting of fourty symbols chosen from 0 .. 9 and a .. f. If we use commit ids in git commands, it is usually sufficient to use the first eight to ten characters to identify the related commit.

And this is the same history, this time including patches:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
$ git log -p (1) commit 3e6696e23eb41b19f4119f8327cd2b9fd5e462c9 Author: Uwe <uwe.schmitt@id.ethz.ch> Date: Fri Apr 27 23:56:42 2018 +0200 fixed upper limit + added docstring diff --git a/print_squares.py b/print_squares.py (2) index 140c0a6..18fce63 100644 --- a/print_squares.py +++ b/print_squares.py @@ -1,3 +1,4 @@ def print_squares(n): - for i in range(1, n): + """prints squares of numbers 1 .. n""" + for i in range(1, n + 1): print(i, "squared is", i ** 2) (3) commit 450b10eb8405130a7e9d7753998b0475d68cd1c8 Author: Uwe <uwe.schmitt@id.ethz.ch> Date: Fri Apr 27 23:50:22 2018 +0200 first version of print_squares.py diff --git a/print_squares.py b/print_squares.py new file mode 100644 index 0000000..140c0a6 --- /dev/null (4) +++ b/print_squares.py @@ -0,0 +1,3 @@ +def print_squares(n): + for i in range(1, n): + print(i, "squared is", i ** 2)
1 git log -p additionally shows the patches related to every commit.
2 This is where the patch for the latest commit starts.
3 And this is where the patch for the latest commit ends.
4 This is a special patch. The /dev/null indicates that the file did not exist before this commit. We see only added lines in the patch, thus this commit describes the very first version of print_squares.py.
Good practice

You might have several changes in your files, but done for different reasons. Maybe one file was changed to fix a bug and changes in two other files implement a new feature of your software. Good practice is to create two commits for this. The first commit should contain the patches for fixing the bug, the other commit implements the new feature.

Commits should also be small. So in case you program daily, better create commits several times a day instead once a week.

One reason for this is that git e.g. allows undoing a previous commit as a whole, so undoing changes gets difficult if you only want to undo a part of a commit.

Try to separate commits which clean or restructure code, commits which fix bugs and commits adding or changing functionality.

Tasks like "I thought I fixed this bug, let’s find and inspect the related commit", or "What code changes actually implemented the new database scheme" will much easier if you follow these recommendations.

5.3.3. Staging Area / Index

To generate a commit we first must create patches and collect them in the so called staging area, also named index. We learn later the git add command which exactly does this.

5.4. Basic Git Hands-On

We introduce now a few git commands in a hands-on session. Please reproduce all steps on your computer. Typing helps to memorize the various git commands.

Setup
1 2 3
$ cd (1) $ mkdir -p tmp/git_intro (2) $ cd tmp/git_intro (3)
1 An empty cd changes the working directory to your home folder.
2 We create a folder for the hands-on, the -p also creates the tmp folder if needed.
3 We cd into this folder. This is now our working directory for the following hands-on.
git init

To put the current folder under version control, we create a local git repository using git init.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ git init . (1) Initialized empty Git repository in /Users/uweschmitt/tmp/git_intro/.git/ $ ls -al (2) total 0 drwxr-xr-x 3 uweschmitt staff 96 Jun 2 23:12 . drwxr-xr-x 37 uweschmitt staff 1184 Jun 2 23:12 .. drwxr-xr-x 10 uweschmitt staff 320 Jun 2 23:12 .git $ git status (3) On branch master No commits yet nothing to commit (create/copy files and use "git add" to track)
1 The . refers to the current working directory. The message you see in the following line will be different on your computer.
2 The .git folder indicates that the current folder (and all future sub-folders) are now under version control. This folder contains all data managed by git, e.g. all commits. Never remove the .git folder !
3 git status tells us that the repository is still empty.
From now on the folder containing the .git directory and all files and subfolders are under version control.
Never create a git repository within an existing repository.
Our first commit

We create a file prime_number.py in the current directory with the following content:

1 2 3 4 5 6 7
def is_prime(n): divisor = 2 while divisor * divisor <= n: if n % divisor == 0: return False divisor += 1 return True

Then we create our first commit:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
$ git add prime_number.py (1) $ git status On branch master No commits yet Changes to be committed: (2) (use "git rm --cached <file>..." to unstage) new file: prime_number.py $ git commit -m "first version of prime_number.py" (3) [master (root-commit) b5f618d] first version of prime_number.py 1 file changed, 7 insertions(+) create mode 100644 prime_number.py
1 We add the new file to the staging area.
2 git status tells us that there is a new file in the staging area.
3 We create a commit, after -m you see the commit message we choose for this commit. Just issuing git commit would open the configured editor where we can enter and save the commit message.
1 2
$ git log fatal: your current branch 'master' does not have any commits yet
Second commit

Now we add a new function to prime_number.py:

1 2 3 4 5 6 7 8 9 10
def is_prime(n): divisor = 2 while divisor * divisor <= n: if n % divisor == 0: return False divisor += 1 return True def primes_up_to(n): return [i for i in range(2, n + 1) if is_prime(i)]

And a new file test_prime_number.py in the same folder:

1 2 3 4
from prime_number import * up_to_11 = primes_up_to(11) assert up_to_11 == [2, 3, 5, 7, 11], up_to_11

git diff tells us differences relative to the last commit:

1 2 3 4 5 6 7 8 9 10 11 12
$ git diff diff --git a/prime_number.py b/prime_number.py index 75e70ec..4b54780 100644 --- a/prime_number.py +++ b/prime_number.py @@ -5,3 +5,6 @@ def is_prime(n): return False divisor += 1 return True + +def primes_up_to(n): + return [i for i in range(2, n + 1) if is_prime(i)]

and git status shows us that there is a new file not under version control yet:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$ git status On branch master Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: prime_number.py Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: prime_number.py Untracked files: (use "git add <file>..." to include in what will be committed) test_prime_number.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14
$ git add prime_number.py (1) $ git add test_prime_number.py (2) $ git status (3) On branch master Changes to be committed: (use "git reset HEAD <file>..." to unstage) modified: prime_number.py new file: test_prime_number.py $ git commit -m "added new function and test script" (4) [master cb29a6b] added new function and test script 2 files changed, 7 insertions(+) create mode 100644 test_prime_number.py
1 We add the patch for the changes in prime_number.py to the staging area.
2 We add the new file test_prime_number.py to the staging area.
3 git status now confirms the state of the staging area holding patches related to the two files.
4 We create a new commit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
$ git show (1) commit cb29a6b5974ba6a571e12209e18753d480761849 Author: Uwe <uwe.schmitt@id.ethz.ch> Date: Sat Jun 2 23:13:33 2018 +0200 added new function and test script diff --git a/prime_number.py b/prime_number.py (2) index 75e70ec..4b54780 100644 --- a/prime_number.py +++ b/prime_number.py @@ -5,3 +5,6 @@ def is_prime(n): return False divisor += 1 return True + +def primes_up_to(n): + return [i for i in range(2, n + 1) if is_prime(i)] diff --git a/test_prime_number.py b/test_prime_number.py (3) new file mode 100644 index 0000000..1cac7da --- /dev/null +++ b/test_prime_number.py @@ -0,0 +1,4 @@ +from prime_number import * + +up_to_11 = primes_up_to(11) +assert up_to_11 == [2, 3, 5, 7, 11], up_to_11
1 git show shows the latest commit. We see that this commit contains two patches.
2 This is the first patch.
3 And this is the second patch.

5.4.1. Summary

These are the commands we used up to now:

  • git add to add new or modified files to the staging area

  • git commit -m '…​' to create the commit.

  • git diff shows current changes relative to the last commit.

  • git status shows the current status of the repository.

  • git log shows the history of the repository.

  • git show to show the latest commit.

5.5. What’s next ?

5.5.1. Other helpfull commands

  • git log --oneline to show a compressed history with one line per commit.

  • git diff --cached to inspect the staging are

  • git show COMMIT_ID to show a commit having id COMMIT_ID.

  • git blame FILE shows each line in the given FILE with information from

  • git reset to empty the staging area in case you added a file unintendedly.

  • git commit --amend to add the current staging area to the latest commit. Usefull if you forgot to add some files in the latest commit. the commit which last modified the line.`

  • git checkout FILE resets the named FILE to the latest version.

  • git checkout . resets all files and folders in the current working directory to the latest git version.

  • git rm to remove files under version control.

  • git mv to rename or move files under version control.

5.5.2. The .gitignore file

git status will also show backup files from your editor or .pyc files or the __pycache__ folder. To ignore such files and folders you can create a file named .gitignore, usually in the main folder of the repository. My repositories usually have a .gitingore file including the following lines:

1 2 3
**/*.pyc (1) **/.*.sw? (2) **/__pycache__ (3)
1 Ignores all files ending with .pyc in all subfolders (**).
2 Ignores all files like .primes.py.swp or .primes.py.swo which are backup files from vim.
3 Ignores files and folders named __pycache__ in all subfolders.
Good practices
  • Also commit the .gitignore file to your repository.

  • Add all huge or binary files to .gitignore.

5.6. How to bring existing source code under version control.

git init . also works is the current folder contains existing code. So to introduce version control for an existing project follow the following instructions:

  • cd to your projects root folder.

  • git init .

  • git status and look for files, folders and name patterns you don’t want to add.

  • Create a .gitignore file to exclude these.

  • git add …​ for all files or folders you want to bring under version control.

  • git reset resets the staging area if you added unwanted files or folders.

  • git add .gitignore

  • git commit -m "initial commit"

5.7. About branches

Branches support maintaining different versions of source code in the same git repository. Below you see an example showing two branches master and feature. In the example below, the master branch is active, and git log will show you commits 3, 2 and then 1. The files and folders in your git repository will also correspond to the commits from this branch.

branch example 0

If you checkout (activate) the branch feature your files will correspond to the commits from the history 4, 2 and 1:

branch example 1

So branches allow switching of files and folders in your repository to manage different versions of your code.

Branches can also be merged. Here we merged feature into master, this created a new commit 5 combining the changes introduced in commits 3 and 4:

branch example 2

The master branch is the default branch, which we used all the time, and which you see in the output of git status from the previous hands-on session.

How to use branches
  • Try to keep the master branch clean and working.

  • If you implement a new feature create a new branch for this. When development is complete and your changes are tested you can merge this branch back into master. This also allows bug-fixes in the master or a dedicated bug-fix branch when the work on feature is not completed.

  • Use branches for different versions of your software. So you can fix older bugs even if the development proceeded meanwhile. This helps when users don’t want the most current version (could break their applications), but still need a bug-fix.

  • We see later how to use branches if you want to step back in time to inspect your repository some commits ago.

5.7.1. Branching and Merging Hands-On

The following examples are based on the repository we used during the first hands-on session teaching basic git commands.

git branch shows available branches, and the active branch is marked with *:

1 2 3 4 5 6
$ git branch (1) * master $ git log --oneline (2) 897d002 added new function and test script 3c6205c first version of prime_number.py
1 We have only one branch in the repository up to now.
2 This is the current history of the master branch.

And the following sketch depicts the state of the master repository:

branch ex 01

Now we create a new branch:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ git branch improve_tests (1) $ git branch (2) improve_tests * master $ git checkout improve_tests (3) Switched to branch 'improve_tests' $ git branch (4) * improve_tests master $ git log --oneline (5) 897d002 added new function and test script 3c6205c first version of prime_number.py
1 git branch improve_tests creates a new branch with the given name.
2 We see now both branches, master is still the active branch.
3 git checkout improve_tests activates the new branch.
4 We see both branches again, but the active branch changed.
5 The new branch still has the same history as the master branch.

This image reflects the state of our repository: two branches with the same history:

branch ex 02

Now we extend the test_prime_numer.py by adding two new lines at the end of the file:

1 2 3 4 5 6
from prime_number import * up_to_11 = primes_up_to(11) assert up_to_11 == [2, 3, 5, 7, 11], up_to_11 assert primes_up_to(0) == [] assert primes_up_to(2) == [2]

Let’s inspect the changes and create a new commit in our new branch:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
$ git diff diff --git a/test_prime_number.py b/test_prime_number.py index aaf9064..b258c6d 100644 --- a/test_prime_number.py +++ b/test_prime_number.py @@ -2,3 +2,5 @@ from prime_number import * up_to_11 = primes_up_to(11) assert up_to_11 == [2, 3, 5, 7, 11], up_to_11 +assert primes_up_to(0) == [] (1) +assert primes_up_to(2) == [2] $ git add test_prime_number.py $ git commit -m "improved tests" [improve_tests 2925e4d] improved tests 1 file changed, 2 insertions(+) $ git log --oneline 2925e4d improved tests 1d864cd added new function and test script 064a280 first version of prime_number.py
1 These are the two new lines.

The branch feature_branch now has a third commit and now deviates from the master branch:

branch ex 03

Let’s switch back to master and check test_prime_number.py:

1 2 3 4 5 6 7 8
$ git checkout master (1) Switched to branch 'master' $ cat test_prime_number.py (2) from prime_number import * up_to_11 = primes_up_to(11) assert up_to_11 == [2, 3, 5, 7, 11], up_to_11
1 We activate the master branch again.
2 And we see that the test_prime_number.py script is without the recent changes we did in the improve_tests branch.

Next we improve our is_prime function. We handle even values of n first and then reduce the divisors to odd numbers 3, 5, 7, …​:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
def is_prime(n): if n == 2: return True if n % 2 == 0: return False divisor = 3 while divisor * divisor <= n: if n % divisor == 0: return False divisor += 2 return True def primes_up_to(n): return [i for i in range(2, n + 1) if is_prime(i)]

We commit these changes:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
$ git add prime_number.py $ git commit -m "improved is_prime function" *** Please tell me who you are. Run git config --global user.email "you@example.com" git config --global user.name "Your Name" to set your account's default identity. Omit --global to set the identity only in this repository. fatal: unable to auto-detect email address (got 'root@f54e1484dc35.(none)') $ git log --oneline fatal: your current branch 'master' does not have any commits yet

And this is the current state of the repository:

branch ex 04

Now we merge the improve_tests branch into master:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
$ git branch (1) improve_tests * master $ git merge improve_tests (2) Merge made by the 'recursive' strategy. test_prime_number.py | 2 ++ 1 file changed, 2 insertions(+) $ git log --oneline --graph (3) * da3a460 Merge branch 'improve_tests' |\ | * 55c9744 improved tests * | cef78db improved is_prime function |/ * 6f98631 added new function and test script * a52a49e first version of prime_number.py
1 We make sure that master is the active branch.
2 We merge improve_tests into the active branch.
3 This shows the history with extra symbols sketching the commits from merged branches. We also see, that git merge created a new commit with an automatically generated message.

And this is the current state of our repository:

branch ex 05
About merge conflicts

A so called merge conflict arises when we try to merge two branches with overlapping changes. In this case git merge fails with an corresponding message, git status also shows a Unmerged paths section with the affected files.

If you open an affected file you will find sections like

<<<<<<< HEAD
while i < 11:
   j += i
=======
while i < 12:
   j += i ** 2
>>>>>>> branch-a

The two sections separated by marker lines <<<<<< HEAD, ======= and >>>>>> branch-a show the conflicting code lines. branch-a is here the name of the branch we want to merge, and as such can differ from merge to merge.

The computer is not able to determine the intended merge of the shown sections, instead human intervention is needed. Thus the users has to edit the file and decide how the code should look like after merging. The markers also have to be removed.

We could e.g. rewrite the example fragment as:

while i < 11:
   j += i ** 2

The full procedure to resolve the merge conflict(s) is as follows:

  1. Open a file with an indicated merge conflict in your editor.

  2. Find the conflicting sections.

  3. Bring the two parts lines between <<<<<< and >>>>>> to the intended result without the three marker lines.

  4. If done for all conflicting sections save the file and quit the editor.

  5. git add FILE_NAME for the affected file.

  6. Repeat from 1. until you resolved all merge conflicts.

  7. git commit (without any message) completes the merge process.

Between the failing git merge and the final git commit your repository is in an intermediate state. Always complete resolving a merge conflict, or use git merge abort to cancel the overall merging process.

5.8. Remote repositories

Up to now we worked with one local git repository. Beyond that git allows synchronization of different repositories located on the same or remote computers.

Known hosting services such as github.com and gitlab.com manage local git repositories on their servers and offer a convenient web interface for additional functionalities supporting software development in a team.

The open-source software gitlab-ce (gitlab community edition) is the base for the free services at gitlab.com and can be installed with little effort. This enables teams or organizations to run their own gitlab instance.

Why remote repositories ?

There are different use-cases for remote repositories:

  1. You want to share your code. For general availability of your code you can use so called public repositories, for a limited set of collaborators you can use private repositories. Public repositories allow others to implement changes or bug-fixes and offer them as so called merge requests (gitlab.com) or pull requests (github.com).

  2. You want work on the same software project in a team.

  3. You develop and run code on different machines. A remote repository can be used as an intermediate storage.

  4. Last but not least, a remote repository can be used as a backup service.

1. Setup an account

To reproduce the following examples you need an account on gitlab.com or github.com. To create an account you have to provide your email address and your name, after that you should receive an email containing a link to complete your registration and to set you password.

2. Create a key pair

To setup passwordless connection of the command line git client to your remote repositories, we need a key pair.

Run this in your terminal:

1 2 3 4 5 6
$ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa.pub (1) $ cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCeBGRBsNYSD5RBSDFdC (2) .... .... +w== uweschmitt (3)
1 You already might have created a key pair (e.g. for passwordless ssh access to other computers). Here we run ssh-keygen only if no such key pair exists.
2 This is where the public key starts
3 And this is where it ends, we ommitted some lines.
3. Add the ssh key to your gitlab / github account

Copy the public key, then

  • if you use gitlab.com, start with step 2 here.

  • on github.com, start with step 2 here.

4. Create a repository on github / gitlab

Choose a name and create a project:

Next we have to determine the URL for ssh access to the repository. This URL has the format git@gitlab.com:USERNAME/PROJECT_NAME.git or git@github.com:USERNAME/PROJECT_NAME.git.

On gitlab.com you find the exact URL on the top of the projects page, on github.com it is shown when you press the Clone or download button.

Using this URL we can now connect our existing git repository to the remote:

1 2 3 4 5 6 7 8 9 10 11 12 13
$ git remote add origin git@gitlab.com:book_eth/book_demo.git (1) $ git remote -v (2) origin git@gitlab.com:book_eth/book_demo.git (fetch) origin git@gitlab.com:book_eth/book_demo.git (push) $ git push origin --all (3) remote: remote: To create a merge request for improve_tests, visit: remote: https://gitlab.com/book_eth/book_demo/merge_requests/new?merge_request%5Bsource_branch%5D=improve_tests remote: To gitlab.com:book_eth/book_demo.git * [new branch] improve_tests -> improve_tests * [new branch] master -> master
1 git remote add NAME URL configures the remote repository, origin is what git users use as default name for the remote repository, the last argument is the URL of the remote repository, you have to adapt this.
2 This shows that the previous command worked.
3 In case the setup of the ssh keys worked, this command uploaded the current state of our repository to origin. Check this on your projects web site.
5. We commit some changes locally

We extend our test script with an extra check:

1 2 3 4 5 6 7
from prime_number import * up_to_11 = primes_up_to(11) assert up_to_11 == [2, 3, 5, 7, 11], up_to_11 assert primes_up_to(0) == [] assert primes_up_to(2) == [2] assert primes_up_to(3) == [2, 3]

Then we create a local commit:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
$ git diff diff --git a/test_prime_number.py b/test_prime_number.py index d31d918..1328d26 100644 --- a/test_prime_number.py +++ b/test_prime_number.py @@ -4,3 +4,4 @@ up_to_11 = primes_up_to(11) assert up_to_11 == [2, 3, 5, 7, 11], up_to_11 assert primes_up_to(0) == [] assert primes_up_to(2) == [2] +assert primes_up_to(3) == [2, 3] $ git add test_prime_number.py $ git commit -m "extended test_prime_number.py" *** Please tell me who you are. Run git config --global user.email "you@example.com" git config --global user.name "Your Name" to set your account's default identity. Omit --global to set the identity only in this repository. fatal: unable to auto-detect email address (got 'root@f54e1484dc35.(none)')

Which we now push to our remote repository:

1 2 3
$ git push origin master (1) error: src refspec master does not match any. error: failed to push some refs to 'git@gitlab.com:book_eth/book_demo.git'
1 git push origin master uploads the latest changes of branch master to origin.
6. Checking out an existing project
1 2 3 4 5 6 7 8 9 10 11 12 13 14
$ cd ~/tmp $ git clone git@gitlab.com:book_eth/book_demo.git (1) Cloning into 'book_demo'... ssh_exchange_identification: Connection closed by remote host fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. $ cd book_demo (2) /bin/bash: line 7: cd: book_demo: No such file or directory $ git log --oneline --graph (3) fatal: Not a git repository (or any of the parent directories): .git
1 git clone URL creates a clone of a remote repository. It creates the clone here in the book_demo subfolder which must not exist before.
2 We cd into the new created folder
3 We see the full history of the master branch

Other branches are not created automatically, although git clone fetched them, but keeps them under a different name:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
$ git branch (1) * (HEAD detached at b45ded9) master $ git branch -a (2) * (HEAD detached at b45ded9) master remotes/origin/HEAD -> origin/master remotes/origin/master $ git checkout improve_tests (3) error: pathspec 'improve_tests' did not match any file(s) known to git. $ git branch (4) * (HEAD detached at b45ded9) master
1 Only shows master as local branches.
2 git branch -a shows all branches incl so called remote tracking branches, here we find`improve_tests` under another name.
3 We can checkout improve_tests.
4 Which is now also available as a local branch.
7. git pull

git pull is used to fetch and merge changes from a remote repository. This is required when git push is used from different local git repositories, ususally when you use git in a team.

git pull origin BRANCH_NAME fetches the given branch from the remote and then merges it into the currently active branch. This merge can also cause merge conflicts if your local changes overlap with changes in the remote repository.

6. Automated code testing

Every programmer knows that making mistakes is a very human trait, and the more complex a program gets, the more opportunities to break working code arise. Automated code testing is a technique to keep this under control.

To honor the practice of automated code testing, most programmers must first experience the pain of maintaining a code base without such tests:

I became scared of making changes to my code. I was no longer sure what depended on what, and what might happen if I changed this code over here…​ Soon I had a hideous, ugly mess of code. New development became painful.
— Harry J. W. Percival
Test-Driven Development with Python

6.1. About automated code testing

Automated testing is no magic tool which you install and which then utilizes a crystal ball to check if your software is correct. Instead it means that you have to write extra code (the testing code) which runs your productive code and checks if it behaves as expected. This testing code is aggregated to a testing suite.

Benefits

Automated code tests help to circumvent the problems addressed in the previous quote and as such learning automated code testing is a big game-changer for many programmers. Automated tests contribute every day to keep programmers blood pressure and heart rates in a healthy range. Having a testing suite you can develop or change code incrementally and run the test suite after every change. Deploying a new version of software after a successful run of the testing suite also lets developers sleep well.

Implementing such automated tests introduces extra work, but with the benefit of keeping your development speed constant and reporting unintended bugs before your customers find them.

The alternative to automated testing is that you test your program manually (hopefully following a specified testing protocol, so that the tests are always the same). This manual procedure compared to automated testing reaches soon a break-even point, after which automated testing becomes more efficient. It also frees the programmer from boring and error-prone manual testing tasks.

Another benefit of automated testing compared to manual testing is, that you will try to postpone tedious manual testing, whereas you can trigger your automated tests as often as you want just by issuing one single command:

  • You renamed a function ? Run the test suite !

  • You changed your code to fix a bug ? Run the test suite !

  • You improved run time of a function ? Run the test suite !

Testable code is good code

Another, non obvious, benefit of automated code testing is that not all code is suited to be tested, but writing code having testability in mind increases code clarity a n d overall code structure. For example small functions, are easier to test than larger functions, but we will see more examples below.

Different kinds of tests

Software testing is a discipline on its own, and we will focus on the most important testing practices you should know and use: these are unit testing, regression testing and continous testing.

  • A unit test checks if one bit of functionality is correct. A test suite of unit tests should run fast.

  • regression tests check if your code produces the same results as before a change.

  • continous (integration) (CI) testing is usually coupled with a remote code repository: for every update you push a defined pipeline of tasks runs on a distinct server. This usually includes building your software (e.g. as a Python package), installing it in a clean and fresh environment, and finally running a test suite on this installed code. Such a CI test can run minutes to hours.

6.2. A minimal unit test

Let us assume we want to implement unit tests for the following productive code:

1 2 3 4 5
# computations.py def area_rectangle(width, height): if width < 0 or height < 0: raise ValueError("negative values for width or height are not allowed.") return width * height

The tests could be in a different file and look like this:

1 2 3 4 5
# test_script.py from computations_ok import area_rectangle assert area_rectangle(2, 3) == 6 (1) assert area_rectangle(3, 2) == 6 assert area_rectangle(3, 0) == 0
1 assert checks if the following logical expression holds, if this is not the case assert raises an AssertionError exception with a message why the expression evaluates to False. Otherwise the Python interpreter executes the next line of code.

6.3. Testing frameworks

The previous example was for exemplification and is not the recommended way to implement unit tests. Instead you should use a testing framework which includes:

  • A standardized way to organize your testing code.

  • A test runner detects your testing code, runs it and reports the results.

  • Utility functions e.g. to generate readable messages why tests fail.

  • Support for so called fixtures and mocks. More about these later.

Usually one creates on test script per module to test. Such a script consists of test functions.

Some early testing frameworks were called like xUnit (e.g. jUnit for JAVA, PyUnit for Python). For Python the testing frameworks nosetest and pytest followed. My preferred modern testing framework for Python is pytest.

To install pytest just run pip install pytest.

6.4. pytest example

We can invoke pytest on the command line with pytest OPTIONS FILES/FOLDERS.

OPTIONS are command line options like -v for verbosity or -x to stop testing after the first failing test function.

Test discovery

FILES/FOLDERS refers to a space separated list files or folders. If you specify folders, py.test recursively iterates over the Python files in every folder, and runs such files if the name starts with test_.

This is now test code to be run with pytest:

1 2 3 4 5 6 7 8 9 10 11 12
# test_computations.py (1) from computations import area_rectangle import pytest def test_area_rectangle(): (2) assert area_rectangle(2, 3) == 6 (3) assert area_rectangle(3, 2) == 6 assert area_rectangle(3, 0) == 0 with pytest.raises(ValueError): (4) area_rectangle(-1, 1)
1 The name of the file with the test functions starts with test_.
2 Test function names start with test_.
3 This is the same strategy as we had before
4 Here we also test if the exception is raised as expected. So if area_rectangle raises a ValueError the test is successful. If no or a different exception is raised the test will fail.

After installing pytest we can run a test file as follows:

1 2 3 4 5 6 7 8 9 10
$ pytest -v test_computations.py (1) =================== test session starts =================== platform darwin -- Python 3.6.5, pytest-3.6.1, py-1.5.3, pluggy-0.6.0 -- /Users/uweschmitt/Projects/book/venv/bin/python cachedir: .pytest_cache rootdir: /Users/uweschmitt/Projects/book/content/chapter_07_testing_introduction, inifile: collecting ... collected 1 item test_computations.py::test_area_rectangle PASSED [100%] (2) ================ 1 passed in 0.01 seconds =================
1 This is how we run pytest.
2 Because we used the -v flag to increase verbosity we see here the linewise status of every test function executed. Else we see only the status of every single test script.

Now we break the productive code:

1 2 3 4 5
# computations.py def area_rectangle(width, height): if width < 0 or height < 0: raise ValueError("negative values for width or height not allowed.") return width + height (1)
1 This is the broken computation.

And now we run pytest again:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$ pytest -v test_computations.py =================== test session starts =================== platform darwin -- Python 3.6.5, pytest-3.6.1, py-1.5.3, pluggy-0.6.0 -- /Users/uweschmitt/Projects/book/venv/bin/python cachedir: .pytest_cache rootdir: /Users/uweschmitt/Projects/book/content/chapter_07_testing_introduction, inifile: collecting ... collected 1 item test_computations.py::test_area_rectangle FAILED [100%] (1) ======================== FAILURES ========================= ___________________ test_area_rectangle ___________________ def test_area_rectangle(): > assert area_rectangle(2, 3) == 6 (2) E assert 5 == 6 E + where 5 = area_rectangle(2, 3) test_computations.py:7: AssertionError ================ 1 failed in 0.07 seconds =================
1 Instead of PASSED we see now FAILED.
2 And this is the diagnostic output why the test function fails.

6.5. What to test

Every function we write should fullfil a contract which includes:

  1. what the function computes for reasonable arguments.

  2. how the function handles corner cases of arguments, e.g. empty lists of file names to process or an empty configuration file.

  3. how the function handles invalid arguments, e.g. negative tolerance parameters or non-existing files.

The cases 2. and 3. are important, and e.g. become crucial at the latest if you or others reuse your code. In the worst case such a function returns a nonsense result which is passed to following computation steps. Eventually later steps produce hardly understandable error messages or the program produces a nonsense result without any indication that something was going wrong.

Developers tend to test expected behaviour and then forget to test error and corner cases. So don’t forget to test error and corner cases.

6.6. Unit tests should be fast

To benefit from automated code testing you should run your test suite after every small change. A longer running test suite will interrupt your work, thus you will tend to run it infrequently.

The quicker the test suite runs the better, a good test suite should not run longer than a few minutes.

6.7. The basic principle of unit tests

The fundamental idea of unit tests is that they should be independent, which means independent from other tests and independent from the environment.

So the order in which test scripts and functions are executed should not matter. A test function must not depend on the outcome of another test function.

Some examples for factors which should not impact the result of a test include:

  • Your user name.

  • The current date or time.

  • The existence of specific files or folders outside your project.

  • The availability of web-services or databases.

6.8. How to achieve this ?

To achieve such independence so called fixtures and mocks are required.

From wikipedia.org: A software test fixture sets up the system for the testing process by providing it with all the necessary code to initialize it, thereby satisfying whatever preconditions there may be.

So to be independent of files on your system a fixture based test could:

  1. Copy example files included in your test suite to a temporary folder

  2. Let your productive code process them

  3. Check the results.

Fixtures also include cleaning up after the test finished. !?!?

EXAMPLE MOCK ?

6.8.1. Independence from environment

Functions without side-effects are easier to test, blalbla.

6.9. Test coverage

example

6.10. Testable code is good code

small units

small functions benefits of small functions, execution paths, …​

what is hard to test

GUI, …​

maximize testability

refactoring example

6.11. Regression tests

Example

6.12. Testing strategies

How to fix bugs
    tests
.How to existing code under tests.

7. The Python standard library

8. The scientific Python stack

9. Practical setup of Python Projects

10. Object oriented Programming

10.1. What is it about ?

Every variable we use in Python refers to an object. Let it be a number, a string, a list or a numpy array. Also modules and functions are objects.

A class is a recipe or template for creating an object. For example Python offers one class str but as many objects of this class as you need.

An object is also called an instance of a class.

Object oriented programming refers to a programming style solving a programming problem by implementing own classes.

In the 90s object oriented programming was considered as the silver bullet to develop reusable code. Thus some programming languages such as JAVA require that you implement everything, even a program which only prints your name, as a class. Python and other programming languages are more flexible and the use of classes may not always be the best solution for a programming problem.

10.1.1. Inspection of the attributes of a class.

Methods of Python objects are also attributes. The only difference is that methods can be called like functions. To inspect the available attributes, Python offers the dir function:

1 2 3 4 5 6 7 8 9
# dir_example.py def add(x, y): return x + y data = {1 : 2} print("methods of add", dir(add)) print("methods of data", dir(data))

The output shows

1 2 3
$ python dir_example.py methods of add ['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__'] methods of data ['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

All listed methods of the function add start and end with double underscores __ which are internal methods not intended for general use. But if you look at the methods of data you will recognize some known methods.

10.1.2. Encapsulation

Usually an object holds some data and offers methods to work with this data. The internal data is also called internal state of the object.

Objects allows grouping data and operations on this data in one place. Another advantage is that you can use classes and objects without caring about their internal details. Instead a good implemented class offers a simple interface to the user and thus supports reusability. Such an interface is also called an

API which is an abbreviation for Application Programming Interface. A class API is a specification of the methods of a class including their functionality and description of arguments and return values. The code using such an API is also named client code.

An API offers an abstract layer for the user of a class. Client code does not need to be changed if internal details of the class change.

Example

A Python dictionary allows handling of a mapping data structure without caring about the internal details how this data is organized. Skilled computer scientists implemented the details and took care that lookup time is very fast without bothering you with such details. In case computer scientists develop better algorithms for handling dictionaries these could be implemented in future Python versions without demanding changes in existing Python code.

Use cases for classes

Your code benefits from using classes in following cases:

  • You have a set of functions which either pass complex data structures around or your functions and their argument lists would become simpler by using global variables (see also Why global variables are bad). Classes offer a good solution in this situation.

  • You want to implement your own data types.

  • You work with data records, e.g. postal address and address books. An address could be implemented as a class, and the address book would be a class to manage the addresses. In this situation a simple solution would be to resort to tuples or [namedtuple]s for the records and a list for the address book, but a class based implementation would hide implementation details.

  • You want to encapsulate complex operations and offer an easy to use API.

Example Address Book

Your address record object could offer attributes like first_name, second_name, street, street_number, zip_code, city and country. The address book object could implement methods like add, find_by_name, delete and save_to_file.

Example Web Service

Communication with a web service, for example a web service providing weather data, consists of many technical details. A class offering methods like current_temperature_in(city_name) can be used to build tools for global weather analysis focussing on the details of the analysis. Switching to another web service with faster response time or more accurate data would only affect the internals of the class leaving the client code untouched.

10.2. Introduction to Python classes

The next sections will teach you the basics of how to implement classes in Python.

10.2.1. Our first class

1 2 3 4 5 6 7 8
# greeter.py class Greeter: (1) def say_hello(self): (2) print("hi you") g = Greeter() (3) g.say_hello() (4)
1 the class statement followed by the name of the class starts the definition of the class, we indent every declaration related to the class, for example declaration of methods.
2 methods are declared similar to functions, ignore the self, we explain this later.
3 we create an instance of class Greeter
4 we call the method say_hello of g, we must not specify the self argument from the definition.

And this is the output:

1 2
$ python greeter.py hi you

10.2.2. What is self ?

Next we investigate what self is:

1 2 3 4 5 6 7 8 9 10 11 12 13
# what_is_self.py class WhatIsSelf: def print_self(self): print("self is", self) u = WhatIsSelf() print("u is", u) u.print_self() v = WhatIsSelf() print("v is", v) v.print_self()

And this is the generated output:

1 2 3 4 5
$ python what_is_self.py u is <__main__.WhatIsSelf object at 0x7f847343b2b0> self is <__main__.WhatIsSelf object at 0x7f847343b2b0> v is <__main__.WhatIsSelf object at 0x7f847343b7b8> self is <__main__.WhatIsSelf object at 0x7f847343b7b8>

Lines 2 to 5: the values after at are the hexadecimal representations of the memory location of the corresponding object.

Comparing the code and the created output we conclude:

Insights
  1. self is the current instance of the class

  2. We always must declare self as the first argument of a method.

  3. We don’t specify the value for self when we call a method. Python inserts the current object for us.

10.2.3. Initializing an object

Python special methods start and end with two underscores __ and have a specific purpose. Such methods are also called dunder methods (abbreviation for double underscore methods).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# point_2d.py class Point2D: def __init__(self, x, y): (1) self.x = x self.y = y def move(self, delta_x, delta_y): self.x += delta_x self.y += delta_y if __name__ == "__main__": p = Point2D(1, 2) (2) print(p.x, p.y) (3) p.move(41, 21) print(p.x, p.y)
1 The __init__ method initializes an object. Here we set x and y as attributes of the current instance of the class.
2 We pass values 1 and 2 as arguments to __init__
3 Based on the implementation of __init__ both values are accessible as attributes of the object.

And this is the output of the program:

1 2 3
$ python point_2d.py 1 2 42 23

10.3. Inheritance

We can use existing classes to create new classes based on existing classes. This is called inheritance. If we inherit from a given class we inherit its methods and can add new methods or overwrite existing ones.

The class we inherit from is called base class, classes inheriting from a base class are called sub classes.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# vector_2d.py import math from point_2d import Point2D class Vector2D(Point2D): (1) def length(self): return math.hypot(self.x, self.y) def add(self, other): result = Vector2D(self.x, self.y) result.move(other.x, other.y) (2) return result if __name__ == "__main__": v1 = Vector2D(3, 4) (3) v2 = v1.add(v1) print(v1.length()) print(v2.x, v2.y)
1 The declaration class Vector2D(Point2D): defines a new class Vector2D starting with the methods of the existing class Point2D. Here we add two extra methods length and add.
2 As we inherit the method move from the base class Point2D we can use the move method.
3 Works because we inherit the dunder method __init__ from Point2D.
1 2 3
$ python vector_2d.py 5.0 6 8
Exercise

Extend Vector2D with a method scale(self, factor) which returns a new Vector2D scaled by the given value. Then implement a method dot(self, other) which returns the dot product of self and other assuming that other is also an instance of Vector2D.

10.4. Special methods *

Python offers more dunder methods than __init__. These support syntactic sugar to make client code look more pythonic. This means such methods are not required for object oriented programming, but can help to offer a more convenient API.

Here we show a few examples:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
# vector_2d_fancy.py import math from point_2d import Point2D class Vector2D(Point2D): def length(self): return math.hypot(self.x, self.y) def __str__(self): (1) return "Vector2D({}, {})".format(self.x, self.y) def __add__(self, other): (2) result = Vector2D(self.x, self.y) result.move(other.x, other.y) return result def __getitem__(self, index): (3) if index not in (0, 1): raise IndexError("index {} must be 0 or 1".format(index)) return (self.x, self.y)[index] if __name__ == "__main__": v1 = Vector2D(3, 4) v2 = Vector2D(4, 5) print(v1, v2) (4) v3 = v1 + v2 (5) print(v3[0], v3[1]) (6)
1 __str__ implements string conversion for instances of the class. This happens if we call str(v1) or if we print an object. See also comment for line 26.
2 __add__ implements + for an object of the class. See also comment for line 27. Python lists and tuples also implement this method.
3 __getitem__ implements indexed access using square brackets. See also comment for line 28. Python strings, lists, tuples and dictionaries also implement this method.
4 print tries to convert the argument to a string if possible, thus calls __str__ in this case. You see the result in the output below.
5 Is the same as v3 = v1.__add__(v2).
6 v3[0] is the same as v3.__getitem__[0].

Compare the output to the code above and check the previous explanations again:

1 2 3
$ python vector_2d_fancy.py Vector2D(3, 4) Vector2D(4, 5) 7 9
Exercise

Implement methods mul(self, other) and rmul(self, other) which return a scaled Vector2D if other is an int or float and the dot product if other is an instance of Vector2D. Also implement imul for multiplication with an int or float. You must first lookup the purpose of those dunder methods.

10.5. Private methods

Sometimes we implement methods which are used from within other methods of the class but are not intended to be used by client code. Such methods are called private methods. Programming languages such as JAVA or C++ provide syntax to mark methods are private and the respective compiler prohibits calling those methods from client code.

Python lacks such a mechanism. Instead most Python programmers mark private methods by preceding the name with a single underscore _. This serves as a signal for client "don’t call me, and if you do, don’t expect that this works as expected".

Implementing methods with names starting but not ending with double underscores have a different purpose and should be avoided, unless you exactly understand what you do. Nevertheless you will find bad examples using such names on the web.

10.6. About Software Design Patterns

A software design pattern is a general, reusable solution to a commonly occurring problem in software design. Such design patterns formalize best practices to help programmers to solve common problems.

The field was pioneered by the book [gof] titled Design Patterns: Elements of Reusable Object-Oriented Software by the so called Gang of Four: Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides.

Such design patterns also depend on the used programming language, e.g. design patterns can help to circumvent problems caused by a static type system or the lack of first class functions.

We already introduced design patterns as MVC and Three Tier Application and will introduce the Template Method Pattern below. Beyond these examples, discussing software design patterns is beyond this book, we recommend THIS BOOK to the interested reader.

TODO: Good book ! TODO: a few examples as Three Tier Application, MVC

10.7. Open/Closed Principle

In object-oriented programming, the open/closed principle states "software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification"; that is, such an entity can allow its behaviour to be extended without modifying its source code.
— https://en.wikipedia.org/wiki/Open/closed_principle

This might sound theoretical and slightly vague, so we better go on with some examples.

The so called template method pattern is a strategy to support the open/closed principle. The basic idea is to implement modifications of a base "operation" by means of sub classes.

The following base class Adder adds numbers of a given list or tuple of numbers, the sub class Multiplier implements a variant:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# template_method_pattern.py class Adder: def accumulate(self, values): (1) result = values[0] for value in values[1:]: result = self.operation(result, value) (2) return result def operation(self, value_1, value_2): (3) return value_1 + value_2 class Multiplier(Adder): def operation(self, value_1, value_2): (4) return value_1 * value_2 data = [1, 2, 3, 4] adder = Adder() print("sum of", data, "is", adder.accumulate(data)) multiplier = Multiplier() print("product of", data, "is", multiplier.accumulate(data))
1 accumulate adds up given numbers based on the method operation.
2 accumulate uses the operation method.
3 operation just adds two numbers.
4 The Multiplier class reuses the accumulate method but operation implements multiplication of two numbers.

And this is the output:

1 2 3
$ python template_method_pattern.py sum of [1, 2, 3, 4] is 10 product of [1, 2, 3, 4] is 24

This is a very simple example for an algorithm which is open for extension while the core code stays closed for modification.

Benefits
  • No "monster classes" with if-elif-else constructs to switch between algorithm variants.

  • Instead of scattering implementation details of a variant over multiple methods in a single class, these details can be grouped within sub classes. This supports the idea of avoiding divergent changes.

The code above would fail if we pass other data than lists of tuples of numbers or if this collection is empty (see line 5). A better implementation would check data first. See also defensive programming.
Abstract base classes

An abstract base class is a class which does not implement all methods required to make it work. Here we could implement a base class Accumulator with method accumulate but without the method operation. Instead we would implement base classes Adder and Multiplier offering this missing method.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# abstract_base_class.py class AbstractAccumulator: def accumulate(self, values): result = values[0] for value in values[1:]: result = self.operation(result, value) return result def operation(self, value_1, value_2): (1) raise NotImplementedError() class Adder(AbstractAccumulator): def operation(self, value_1, value_2): (2) return value_1 + value_2 class Multiplier(AbstractAccumulator): def operation(self, value_1, value_2): (4) return value_1 * value_2 data = [1, 2, 3, 4] adder = Adder() print("sum of", data, "is", adder.accumulate(data)) multiplier = Multiplier() print("product of", data, "is", multiplier.accumulate(data))
1 We avoid using the abstract base class directly. A better, but longer example would use the abc module from the standard library.

10.7.1. Strategy pattern

In contrast to the template method pattern we configure the behaviour of a class by passing objects to the initializer:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# strategy_pattern.py class Accumulator: def __init__(self, operation): self._operation = operation (2) def accumulate(self, values): result = values[0] for value in values[1:]: result = self._operation(result, value) (3) return result data = [1, 2, 3, 4] def add(a, b): return a + b adder = Accumulator(add) (1) print("sum of", data, "is", adder.accumulate(data)) def multiply(a, b): return a * b multiplier = Accumulator(multiply) (4) print("product of", data, "is", multiplier.accumulate(data))
1 To configure the object We pass the function add to the initializer.
2 We set this function as a private attribute.
3 Within acuumulate we use this attribute, in this example to sum the given values.
4 We can re use the same class but with a different behaviour by providing a different function.

This is a simple example using functions, in more complex situations we would pass one or also more objects with specified methods to __init__.

Strategy pattern outside object oriented programming

The basic idea of the strategy pattern can be found in numerical algorithms without requiring object oriented programming.

The following function approx_zero implements the bisection method to find a zero of a given function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# bisect.py def approx_zero(f, a, b, eps=1e-8): assert f(a) * f(b) < 0 (2) while abs(a - b) > eps: middle = (a + b) / 2 if f(middle) * f(a) > 0: (3) a = middle else: b = middle return (a + b) / 2.0 def polynomial(x): return x ** 2 - 2 sqrt2 = approx_zero(polynomial, 1, 2) (1) print("zero is", sqrt2, "with residual", polynomial(sqrt2))
1 We want to approximate the square root of 2 by approximating the zero of the given function polynome. We know that the searched for value lies between 1 and 2.
2 To ensure that the algorithm works we must ensure that f takes different signs for a and b.
3 This checks if f(middle) and f(a) have the same sign.
1 2
$ python bisect.py zero is 1.414213564246893 with residual 5.29990096254096e-09

This was not possible in older programming languages such as Fortran 77, where the function name was hard coded in the bisection method implementation and reusing the algorithm in different places was error prone.

11. Basics of Code optimization

11.1. An example

The following function count_common takes two lists reference and data and counts how many elements of reference occur also in data. This is a simple and slightly unrealistic example, but good enough to explain a few fundamental principles of code performance and optimization.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
# time_measurement.py import random import time def count_common(reference, data): """counts how many items in data are in present in reference""" matches = 0 for value in reference: if value in data: matches += 1 return matches for n in (2000, 4000, 8000, 16000): reference = [random.randint(0, n) for _ in range(n)] (1) data = [random.randint(0, 2 * n) for _ in range(n)] started = time.time() (2) count_common(reference, data) needed = time.time() - started (3) print("n = {:5d} time: {:5.2f} seconds".format(n, needed))
1 random.randint(0, n) creates a pseudo random number in the range 0 to n (limits included).
2 time.time() returns the number of seconds (including figures after the decimal point) since 1st of January 1970 (this date is also called the Unix epoch).
3 Calling time.time() and subtracting the value of started we get the observed runtime for executing count_common.

Timing measurements in this chapter may vary, so don’t expect to get exactly the same output here and in following examples:

1 2 3 4 5
$ python time_measurement.py n = 2000 time: 0.05 seconds n = 4000 time: 0.19 seconds n = 8000 time: 0.75 seconds n = 16000 time: 3.02 seconds

We observe that every time we double the size, the overall runtime approximately increases by a factor of four. This is the same as saying that the runtime is proportional to n2.

A simple calculation shows that n = 250000 would result in about 10 minutes processing time, running the code with n = 610000 would need about one hour to finish ! This also means, that a program which works well on small data sets during development will have unacceptable runtime in real word situations.
Profiling

Tools called profilers provide means to understand how the overall runtime of a program is distributed over functions and individual lines of code. At the time of writing this book, the Python package line_profiler is our preferred profiling tool.

To analyze code with line_profiler we first have to install it using pip install line_profiler. In case you are still using Python 2.7, first run pip install 'ipython<6', else pip install line_profiler will fail.

Next we decorate the function we want to inspect with @profile:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# profiling.py import random import time @profile def count_common(reference, data): matches = 0 for value in reference: if value in data: matches += 1 return matches n = 5000 reference = [random.randint(0, n) for _ in range(n)] data = [random.randint(0, 2 * n) for _ in range(n)] count_common(reference, data)

line_profiler offers a command line tool kernprof which we use as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
$ kernprof -vl profiling.py Wrote profile results to profiling.py.lprof Timer unit: 1e-06 s Total time: 0.300982 s File: profiling.py Function: count_common at line 5 Line # Hits Time Per Hit % Time Line Contents ============================================================== 5 @profile 6 def count_common(reference, data): 7 1 6.0 6.0 0.0 matches = 0 8 5001 1788.0 0.4 0.6 for value in reference: 9 5000 298343.0 59.7 99.1 if value in data: 10 1965 845.0 0.4 0.3 matches += 1 11 1 0.0 0.0 0.0 return matches

The output shows us:

  • Starting from line 13 we see line-wise time measurements for the decorated function.

  • Column Hits counts how often the actual line was executed.

  • Column Time is the overall time spent executing this line in µs.

  • Column Per Hit is the average execution time of this line in µs.

  • Column % Time is the fraction of the overall time running the function.

So we see that about 99% of the runtime is spent in if value in data !

Analysis

Checking if an item is element of a list can be pretty slow. The Python interpreter (in the worst case) has to iterate through the full list to do this check. So on average we can expect that the runtime of this line is proportional to the length n of the list. We further see that the specific line is executed n times which supports our observation that the runtime is proportional to n2.

How to fix this ?

The appropriate and optimized data type for checking membership is set and not list. For a set the runtime for x in y is on average constant independent of the size if y.

As our program is independent of the order of the values in data we do no harm when we convert the data type to set:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# profiling_optimized.py import random import time @profile def count_common(reference, data): matches = 0 # sets are faster for membership test (2) data = set(data) (1) for value in reference: if value in data: matches += 1 return matches n = 5000 reference = [random.randint(0, n) for _ in range(n)] data = [random.randint(0, 2 * n) for _ in range(n)] count_common(reference, data)
1 This is the modification.
2 We comment the conversion, because its purpose is not obvious.

And this is the result running the profiler again:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
$ kernprof -vl profiling_optimized.py Wrote profile results to profiling_optimized.py.lprof Timer unit: 1e-06 s Total time: 0.004054 s File: profiling_optimized.py Function: count_common at line 5 Line # Hits Time Per Hit % Time Line Contents ============================================================== 5 @profile 6 def count_common(reference, data): 7 1 6.0 6.0 0.1 matches = 0 8 # sets are faster for membership test 9 1 229.0 229.0 5.6 data = set(data) 10 5001 1469.0 0.3 36.2 for value in reference: 11 5000 1720.0 0.3 42.4 if value in data: 12 1987 629.0 0.3 15.5 matches += 1 13 1 1.0 1.0 0.0 return matches

Overall runtime (Line 5) reduced from 0.25 to 0.004 seconds by choosing the right data structure ! You also can notice that data conversion in line 4 only adds insignificant extra runtime. Further we see no "hot spots" of slow code anymore, the Per Hit values in Lines 15 to 17 are the same.

Final version

We optimize our code further: To count the number of common elements we convert reference and data to sets and compute the size of the intersection of both sets.

1 2 3 4 5 6 7 8 9 10 11 12 13 14
# profiling_further_optimized.py import random import time @profile def count_common(reference, data): reference = set(reference) data = set(data) return len(reference & data) n = 5000 reference = [random.randint(0, n) for _ in range(n)] data = [random.randint(0, 2 * n) for _ in range(n)] count_common(reference, data)

The result is impressive. We further reduced runtime by a factor of about 10:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$ kernprof -vl profiling_further_optimized.py Wrote profile results to profiling_further_optimized.py.lprof Timer unit: 1e-06 s Total time: 0.00063 s File: profiling_further_optimized.py Function: count_common at line 5 Line # Hits Time Per Hit % Time Line Contents ============================================================== 5 @profile 6 def count_common(reference, data): 7 1 272.0 272.0 43.2 reference = set(reference) 8 1 240.0 240.0 38.1 data = set(data) 9 1 118.0 118.0 18.7 return len(reference & data)
Exercise

Measure runtimes of both optimized versions for different n as we did in the timing example. Confirm that for both optimizations runtime is approximately proportional to n but with different proportionality factors.

Insight

Assume we can describe runtime of a program as \$T(n) = \alpha n^k\$, then we can focus on lowering the value of \$\alpha\$ or we try to reduce k which makes our program future proof for larger n.

In the examples above the first optimization reduced k from 2 to 1, the last code change lowered \$\alpha\$.

11.2. Some theory

The previous section demonstrated some basic principles, now we introduce and explain some general concepts.

11.2.1. Big O Notation

Big O notation is about the rate of growth of runtime depending on a given problem size n. Usually n refers to the amount of data to process, e.g. the number of elements of a list we want to sort.

O(f(n))

The so called Big O notation describes the worst runtime of a program depending on an input parameter n, When we say the runtime is O(f(n)) we mean that for large n worst case runtime is proportional to f(n) plus some slower growing terms depending on n which we ignore.

Using this notation we can say that we reduced runtime above from O(n2) to O(n).

Example

Assume we can describe the runtime of a given program as \$T(n) = 300 n + 0.01 n^2 + 50 log(n)\$. The dominating term for large n in this case is \$n^2\$ and as we do not care about the constant proportionality factors, we can say "runtime is \$O(n^2)\$".

The Big-O-notation is useful to classify computing problems and algorithms. It also tells us how a programs runtime behave for larger inputs.

The most important classes and examples, in increasing order, are:

O(1)

Constant runtime independent of the problem to solve. Examples are: algebraic operations as adding two numbers, calling implementations of analytical functions as math.sin, dictionary and set insertion and lookup.

11.2.2. A Runtime Analysis Example

We want to find the positions of k different elements in a list li of n arbitrary numbers. We compare the following three solutions for this problem:

  1. We call the li.index method k times. One call of .index is O(n) thus total runtime will be O(kn).

  2. We sort the list first, then use the bisect method from the standard library to use the binary search algorithm for the lookups. Sorting in Python is O(n log n), binary search is O(log n). Thus overall runtime is O(n log n) + O(k log n).

  3. We create a dictionary mapping elements to their positions first, then use k times dictionary lookup. Building the dictionary is O(n), according to the Big-O examples above k times lookup is O(k). Thus overall runtime is O(n) + O(k).

We can assume that the hidden proportionality factors are of same magnitude and that n is not too small.

Exercise

Plot runtimes of all three approaches for different proportionality factors and different k and n.

Analysis
  1. For very small k the extra cost introduced by sorting would not pay out and the first approach should be preferred.

  2. But even for moderate k the O(k log n) of binary search will outperform O(kn) for larger n by far.

  3. Except for k = 1, this is the fastest approach of all three but also includes memory overhead for the dictionary. This dictionary will consume about the same amount of memory as the given list or even a bit more

11.3. Strategies to reduce runtime

11.3.1. Avoid duplicate computations

Avoid computing the same result over and over again. Move such computations if possible out of loops or functions. Try to precompute data and use Python containers as lists, dictionaries or sets to store intermediate results.

Especially slow operations, as reading the same file over and over again, should not appear in a loop or a function.

Caching as explained below is a generic strategy to avoid such duplicate computations.

11.3.2. Avoid loops

Loops in Python are slow. This is caused by the dynamic implementation of loops which allows us to write loops like for c in "abcd": or for line in open('names.txt'):. C loops are much more restricted but can be directly compiled to very fast low level machine code instructions.

There are two aspects to consider:

Reduce number of nested loops

The number of nested loops contributes to the Big-O class as described above. So try to reduce the nesting level if possible. Sometimes this can be achieved by precomputing intermediate results outside the loop. Spend some time to come up with a different strategy with less nested loops. Else ask computer scientists for better known solutions.

Replace for and lists

Replace loops and lists by better suited data structures. The last optimization example demonstrated this. Other data structures such as numpy arrays offer highly optimized alternatives for often used numerical operations. So is adding a number to given numpy vector or matrix much faster than an alternative Python implementation based on lists. Libraries as numpy or pandas implement many operations in low level C routines. This is possible because all elements of a numpy array or data in a column of a pandas dataframe have the same data type, whereas Python lists have no such restriction.

Learn about vectorization and broadcasting to use numpy and pandas efficiently !
Example

The so called accumulated maxima of a list [a0, a1, a2, …​] is [a0, max(a0, a1), max(a0, a1, a2) …​]. Here we use itertools from the standard library to compute the accumulated maxima of a given list of numbers and compare runtime to a naive Python implementation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# itertools_example.py from itertools import accumulate @profile def naive_implementation(a): latest = a[0] result = [latest] for ai in a[1:]: new = max(latest, ai) latest = new result.append(new) return result @profile def with_itertools(a): return list(accumulate(a, max)) a = list(range(10000)) assert naive_implementation(a) == with_itertools(a)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
$ kernprof -vl itertools_example.py Wrote profile results to itertools_example.py.lprof Timer unit: 1e-06 s Total time: 0.015882 s File: itertools_example.py Function: naive_implementation at line 4 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4 @profile 5 def naive_implementation(a): 6 1 6.0 6.0 0.0 latest = a[0] 7 1 1.0 1.0 0.0 result = [latest] 8 10000 3355.0 0.3 21.1 for ai in a[1:]: 9 9999 5071.0 0.5 31.9 new = max(latest, ai) 10 9999 3442.0 0.3 21.7 latest = new 11 9999 4007.0 0.4 25.2 result.append(new) 12 1 0.0 0.0 0.0 return result Total time: 0.001482 s File: itertools_example.py Function: with_itertools at line 14 Line # Hits Time Per Hit % Time Line Contents ============================================================== 14 @profile 15 def with_itertools(a): 16 1 1482.0 1482.0 100.0 return list(accumulate(a, max))

The itertools version is about 10 times faster than our naive Python implementation.

11.3.3. Caching

Caching is the technique to store results of a long running computation in a cache (usually a dictionary using the input data of the computation as key). Computing the same result again can be avoided by first looking up the result in the cache. Pythons standard library offers an easy to use LRU Cache.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# lru_example.py from functools import lru_cache import time @lru_cache() (1) def pseudo_slow_function(seconds): time.sleep(seconds) (2) return seconds * 2 @profile def main(): assert pseudo_slow_function(0.2) == 0.4 assert pseudo_slow_function(0.1) == 0.2 assert pseudo_slow_function(0.2) == 0.4 assert pseudo_slow_function(0.1) == 0.2 main()
1 This is how we annotate a function to be cached.
2 time.sleep(seconds) pauses program execution for seconds seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
$ kernprof -vl lru_example.py Wrote profile results to lru_example.py.lprof Timer unit: 1e-06 s Total time: 0.300732 s File: lru_example.py Function: main at line 10 Line # Hits Time Per Hit % Time Line Contents ============================================================== 10 @profile 11 def main(): 12 1 200429.0 200429.0 66.6 assert pseudo_slow_function(0.2) == 0.4 13 1 100289.0 100289.0 33.3 assert pseudo_slow_function(0.1) == 0.2 14 1 11.0 11.0 0.0 assert pseudo_slow_function(0.2) == 0.4 15 1 3.0 3.0 0.0 assert pseudo_slow_function(0.1) == 0.2

You see above that the µs values in the Time column match the function arguments for the first two function invocations in lines 13 and 14. Lines 15 and 16 show that calling the function with same arguments as before return correct values instantly.

Another option is to use joblib.Memory which persists the cache on disk, so that the cache content is stored during consecutive program executions at the cost of a small and usually neglectable runtime overhead for reading and writing data from resp. to disk.

11.3.4. Data structures and Algorithms

As we have seen before dictionaries and sets are pretty fast in Python even for large data sets. Choosing the right data structure or algorithm may reduce the Big-O complexity significantly.

This is where computer science commes into play. This field of research tries to develop optimal algorithms in terms of Big O, the hidden proportionality factor usually depends on the chosen programming language and implementation details.

So try to learn some basics of algorithms and datastructures. Don’t hesitate to ask computer science experts, you can expect good answers on Stack Overflow if you ask politely. Try to provide a minimal and short example to allow reproduction of your problem by others.

11.3.5. Avoiding memory exhaustion

Using more than your computers available RAM causes swapping which also slows down computations tremendously. A typical effect of swapping is that your computer gets unresponsive whilst processing larger data sets. So choosing a suitable data structure or algorithm can also depend on memory consumption beside runtime efficiency. See Best Practices below.

11.3.6. Running computations on multiple cores or machines

Facilitating multiple cores or computers will not reduce the Big O class of your computation but may reduce the associated proportionality factor. In the very best case a computation running on n cores / computers will be faster by a factor of n than the computation on a single core / computer. But this requires that your computation can be split into n independent sub problems and that handling these sub problems does not introduce a significant runtime overhead.

Map Reduce

Map reduce is a programming model only applicable if we can process parts of our data independently. The concept of map reduce is to split data into parts being processed in parallel (map step) followed by the final reduce step combining the partial results to the final result.

A map reduce implementation to sum up a huge set of numbers would split the set to n approximately equal sized parts, compute the sum of every part in parallel and finally add up the partial results to the overall sum.

In Python the multiprocessing module from the standard library supports map reduce style computations on multiple cores.

The idea is that multiprocessing launches multiple Python interpreters on different cores, also called workers. The main interpreter running the actual program then distributes computations to these workers depending on their availability.

Example using multiprocessing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# multiprocessing_example.py import time import multiprocessing args = (0.5, 0.3, 0.2) def pseudo_computation(n): time.sleep(n) return n + 1 def sequential(): (1) for arg in args: pseudo_computation(arg) def parallel(): (2) n_cores = multiprocessing.cpu_count() p = multiprocessing.Pool(n_cores - 1) (3) result = p.map(pseudo_computation, args) (4) p.close() (5) p.join() assert result == [1.5, 1.3, 1.2] def time_function(function): (6) started = time.time() result = function() print("{:10s} needed {:.3f} seconds".format(function.__name__, time.time() - started)) time_function(sequential) time_function(parallel)
1 We call pseudo_computation sequentially for all values of args. As time.sleep dominates runtime of pseudo_computation overall execution time for sequential should be about 1 second.
2 This functions applies pseudo_computation to values from args on different cores.
3 We create a process pool here, using "number of cores minus one" workers. Using all available cores can turn your computer unresponsive until parallel computations finish. Every worker is a separate Python interpreter running on a different core.
4 This is how we distribute calling pseudo_computation with all values from args to the created pool of workers. As soon as all values from args are processed, Pool.map returns a list of return values from pseudo_computation in the same order as the arguments.
5 Never forget to shutdown the process pool as done in this and the following lines. Else your main program might hang when finished.
6 Regrettably line_profiler and multiprocessing don’t work together, this is why we implement our own function for simple time measurements.
1 2 3
$ python multiprocessing_example.py sequential needed 1.001 seconds parallel needed 0.632 seconds

Line 2 shows that the sequential function needs about 1 second as expected. The parallel solution is dominated by the slowest call of pseudo_computation which takes about 0.5 seconds, Line 3 confirms this and also indicates the extra time needed for starting and shutting down the workers.

Creating a process pool, transfering data to the workers and results back come with a runtime overhead also depending on your computers operating system. This overhead will dominate overall runtime if the executed function returns quickly, e.g. within fractions of a second. Also try to reuse such a process pool within a program if you call methods of Pool such as .map in different places of your code.

11.3.7. Running computations on a computing cluster

Computing clusters for high performance computation (HPC) allow running computations on hundreds or thousands cores in parallel. If you have access to such an infrastructure contact the administrators to learn how to use it.

MPI is one of the most versatile libraries for large scale parallel computations but also has a steep learning curve. mpi4py allows using MPI from Python.

Another widely used framework for distributed computing is Apache Spark which also supports Python.

11.3.8. Cython

Python can also load and run compiled C (and C++) code. Cython supports developing such extension modules for non C or C++ programmers. The Cython language consists of a subset of Python plus a few extra statements such as type declarations. Cythons tool chain translates this code to C or C++, then compiles and links it to a module suitable for import from Python code. As this language is close to Python, Cython is more accessible for most Python programmers than writing C or C++ code directly, although maximizing Cythons effects may need some deeper understanding of C or C++.

Especially code having nested for .. in range(..) loops benefits from using Cython. Speedup by factors of 40 up to 100 are no exceptions here.

The process using Cython is as follows:

  1. Organize your existing Python code as a Python Package.

  2. Modify your setup.py to use Cython.

  3. Move one or more slow Python functions without any code modification to a new file with .pyx extension.

  4. python setup.py develop will then compile the .pyx file to an "importable" Python module.

  5. Expect speedup of factor 2 to 4.

  6. Add type annotations and apply other Cython features to further optimize your code.

For "playing around" pyximport avoids the effort to setup a Python Package.

Cython Code Example
1 2 3 4 5 6 7
# cython_example.pyx def factorial(int n): (1) cdef long long result = 1 (2) cdef int i (3) for i in range(2, n + 1): result *= i return result
1 When we declare a Cython function we can also specify argument types.
2 We also declare the type of the result. A long long C type can hold large integers.
3 To enable Cythons loop optimizations we also declare the type of our looping variable i.

This Cython implementation is the same as the Python implementation except the type declarations:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# cython_benchmark.py import pyximport pyximport.install() (1) from cython_example import factorial as f_c (2) def f(n): result = 1 for i in range(2, n + 1): result *= i return result @profile def main(): for _ in range(100): f(25) (3) for _ in range(100): f_c(25) main()
1 This installs a so called import hook.
2 After installing this import hook, the import here first checks if file cython_example.pyx exists and then triggers the Cython tool chain, including C code generation from cython_example.pyx plus calling the C compiler and linker. Finally the function factorial is imported.
3 We call Cython and Python implementation 100 times to get comparable numbers as line_profile is not very exact for short execution times.

And here is what line_profiler tells us:

1 2 3 4 5 6 7 8 9 10 11 12 13 14
$ kernprof -vl cython_benchmark.py Wrote profile results to cython_benchmark.py.lprof Timer unit: 1e-06 s Total time: 0.000777 s File: cython_benchmark.py Function: main at line 12 Line # Hits Time Per Hit % Time Line Contents ============================================================== 12 @profile 13 def main(): 14 101 740.0 7.3 95.2 for _ in range(100): f(25) 15 101 37.0 0.4 4.8 for _ in range(100): f_c(25)

So the Cython implementation as faster by a factor of ~20.

More about Cython

The official (and brief) Cython tutorial is available at here. This Cython book is more detailed with many programming examples.

Cython can also be used to implement Python bindings to existing C or C++ code.

Use the -a flag when invoking Cython. This creates a HTML file showing the slow spots in your code which would benefit from further type annotations (also demonstrated here). Switching off boundary checks and wraparound can also reduce runtime for array centric code.

11.3.9. Numba

Numba may speed up numerical and scientific computations to C speed using a so called Just In Time Compiler (JIT). Simply import numba then annotate your function with @numba.jit to enable numba JIT compilation. If you benchmark your code, call such functions at least twice, the first call is slower than subsequent runs, because the JIT analyses the functions code during this first execution.

Numba is thus easier to introduce than Cython but is also more focussed on optimizing numpy array based code.

11.3.10. I/O bound computations

The considerations up to now where related to so called CPU bound computations, where the work of your computers processor dominates runtime.

In case your program loads large amounts of data from disk or fetches data from a remote web server, the data transfer can dominate the runtime, whilst the CPU is doing nothing but waiting for the data to arrive. In this case we say that the computation is I/O bound. I/O is an abbreviation for input/output.

Minimize file sizes

Switching from text based files to binary file formats, e.g. from csv to hdf5, can reduce file sizes and reading speed by a significant factor.

pandas offers readers and writers for different text and binary file formats for heterogeneous tabular data. For data of a fixed numerical type resp. precision, numpy can be used.

The sqlite3 module from the standard library, h5py and pytables offer fast access to specific elements or sections of data. This can be useful if we do not need the full data set for processing, or if the data set does not fit into the available memory.

Multithreading and asyncio

The standard libraries threading module allows concurrent execution of Python code. This means parts of the code appear to run simultaneously. A so called scheduler switches execution of these parts quickly such that from the users perspective this looks like parallel execution, but contrary to multi-core execution the overall runtime of the program is not reduced.

Multithreaded programming is full of pitfalls and can cause severe debugging nightmares. The fact that threads share variables (contrary to using multiprocessing), can cause nasty problems such as race conditions.

Multi-threading can be used to download multiple files concurrently to improve I/O bound file transfer over the internet. The Queue tutorial shows a clean example how to implement such a strategy without risking race conditions.

Python 3 also offers means for so called asychronous I/O, but this is beyond the scope of this book.

11.4. Best practices

11.4.1. Slow or too slow ?

Due to its interpreted and highly dynamic nature Python is not the fastest language, especially compared to compiled languages such as C. This does not automatically turn Python programs unsable. A Python implementation running 100 ms instead of 1 ms in C would not be noticed to be slow when run a single time. Executed a million times the difference would be notable 28 minutes vs 17 seconds.

Implementing the same Python program in C requires much more programming effort, thus one has to balance fast development speed vs. slower execution speed.

11.4.2. Measure first

Never guess what parts of your program are slow ! Experience shows that your assumptions about the slow parts will be wrong in most cases.

Always use a profiler or similar tools first when you consider optimizing runtime.

Be careful

Profilers measurements are not 100% exact and numbers should not be taken as absolute truth. Also consider that a profiler slows down program execution due to the measurements taken.

Practical advice

The data set size or input parameters you choose can have a huge impact on which parts of your code dominate the overall runtime. Assume

1 2 3 4 5
def foo(n): ... def bar(n): ...

with runtimes

\$T_{f o o}(n) = \alpha n^2\$

and

\$T_{b a r}(n) = \beta n\$

But with a \$\alpha\$ much smaller than \$\beta\$.

A profiler could indicate that bar is slower than foo for small values of n, but would show a different result for larger values of n.

Choosing a small n during iterative optimization of your code can lead to an efficient programming process due to short feedback times, but comes with the risk to optimize the wrong parts.

So profile for a small n and a reasonable large n first, so you see differences in runtime distribution over your code. Then try to find a reasonable n to reduce your development time whilst still focusing on the important code parts.

Using time.

If you use a "home made" time measurement, e.g. using the time.time() function, measure the same code with same configuration multiple times. To compensate the bias introduced by background processes on your computer take the minimum of measured times as reference value for comparing different versions of your code. An alternative approach is to use the timeit module from the standard library.

Measure or observe memory consumption

As mentioned above, swapping can slow down computations tremendously. Either monitor memory consumption in your systems task manager (which should indicate swapping) or use tools like memory_profiler to check memory usage. Linewise memory profiling can be very inaccurate, so profile multiple times and don’t expect exact numbers.

11.4.3. Premature optimization is the root of all evil

Premature optimization is the root of all evil.
— Donald Knuth
1974 Turing Award Lecture

This statement from Donald Knuth, an influential computer scientist, does not state that optimizing code in general is evil, but premature optimization should be avoided.

The reason for this is that optimized code often requires some programming tricks resulting in code being less clear and hard to understand. Early optimization could also focus on parts not dominating runtime in the final version and thus sacrifices code quality unnecessarily.

So this is the recommended order:

  1. Get your code working, focus on readability and good programming practices

  2. Optimize your code with the help of a profiler

  3. Document your optimizations.

12. Misc

Glossary

select is not broken

…​..

bike shedding

…​..

pythonic

…​.

syntactic sugar

…​.

Profiler

..

Pseudo Random

..

Just In Time Compiler(JIT)

…​

RAM

…​

IDE: …​.

PEP: …​.

SSH: …​.

URL: ..

References

  • [gof] Erich Gamma, Richard Helm, Ralph Johnson & John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley. 1994.

13. Some Asciidoctor Examples

13.1. Example 1

Rubies are red, Topazes are blue. My eye is black abc

13.2. Example 2

this is a sidebar

this is another sidebar

This is a note block with complex content

A list
  • item

  • item

  • item

This is anote asdfl ja ljd ldakfsj dalfjdlkjfaljdflöja adsflkja dfl kjads öljdf ölajdf öldfjöldaf jadlf jadlfj dlfj adsflkja dfl kjads öljdf ölajdf öldfjöldaf jadlf jadlfj dlfj adsflkja dfl kjads öljdf ölajdf öldfjöldaf jadlf jadlfj dlfj
Single line tip
Don’t forget…​
Watch out for…​
Ensure that…​

13.3. Quote

Four score and seven years ago our fathers brought forth on this continent a new nation…​

13.4. TODO

  • asciidoc comdline + file observer + liver reload im browser, themes ? pygments

    • subitem

  • grobsturktur aufsetzen

  • in gitlab einchecken

13.5. With numbers

  1. asciidoc comdline + file observer + liver reload im browser, themes ? pygments

    1. subitem

  2. grobsturktur aufsetzen

  3. in gitlab einchecken

13.6. Source code listing

Code listings look cool with Asciidoctor and pygments.

1 2 3 4 5 6 7
#!/usr/bin/env python import antigravity try: antigravity.fly() (1) except FlytimeError as e: (2) # um...not sure what to do now. pass
1 does not work
2 this is a fake exception

13.7. In-document refernes

A statement with a footnote.[1]

bla bla bla
bla bla bla
bla bla bla
bla bla bla

13.8. Other formatting

indented by one line in adoc file
indented by one line in adoc file

bold italic

This word is in backticks.

“Well the H2O formula written on their whiteboard could be part of a shopping list, but I don’t think the local bodega sells E=mc2,” Lazarus replied.

Werewolves are allergic to cinnamon.

13.9. Tables

Table 1. An example table
Col 1 Col 2 Col 3

6

Three items

d

1

Item 1

a

2

Item 2

b

3

Item 3

c

13.10. Images

This should be some floating text.
This should be some floating text.
This should be some floating text.
This should be some floating text.
This should be some floating text.
This should be some floating text.

13.11. Quotes

  • Quotes:

    It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness…​
    — Charles Dickens
    A Tale of Two Cities

End of quote

13.12. Chapter with some references

The Pragmatic Programmer [pp] should be required reading for all developers. To learn all about design patterns, refer to the book by the “Gang of Four” [gof].


1. Clarification about this statement.