class: inverse, center, middle, remark-frontslide-content
--- class: middle ## About me - Dr. Uwe Schmitt - Work for Scientific IT Services (SIS) - Scientific programmer - I also work as tutor and consultant. --- class: center, inverse # **Our Goal:** always produce same results from same data --- class: center, inverse # **Our Goal:** always produce same results from same data ## At any time --- class: center, inverse # **Our Goal:** always produce same results from same data ## At any time ## At any place --- class: center, inverse # **Our Goal:** always produce same results from same data ## At any time ## At any place ## By any person --- class: middle ## What can go wrong? 1. Software / tools are not available (anymore). 2. Used software is fragile. 3. Processing steps are not documented. 4. Human mistakes during processing. --- class: middle ## 1. Not available software / tools - Use open source software / programming languages. - Publish your code using an open source license. --- class: middle ## 2. Software is fragile - Google for "excel hell"! --- class: center, middle, remark-frontslide-content, inverse
--- class: middle ## 2. Software is fragile - Excel: incorrect leap year calculations 1900-02-29 - [7 Worst Excel Mistakes of All Time](https://www.linkedin.com/pulse/7-worst-excel-mistakes-all-time-nate-coughran-cpa)
--- class: middle, center, inverse ## 3. Processing steps are not documented. ## 4. How to avoid human mistakes? --- class: center, middle, remark-frontslide-content, inverse
--- class: center, middle, remark-frontslide-content, inverse
--- class: middle ## Recipes / lab protocols: - List of simple steps - More or less exact instructions - Executed by humans --- class: center, middle, remark-frontslide-content, inverse
--- class: middle ## Programs ```python numbers = read_txt("numbers.txt") average = sum(numbers) / len(numbers) print("average is", average) ``` ```bash average is 12.34 ``` - List of simple steps - Exact instructions - Executed by unforgiving computers --- class: middle ## Why to program? - Reduce / no manual steps in your analysis - Automate as much as possible - **Good code** is implicit documentation how you produced results - Others can trace your steps --- class: center, middle, remark-frontslide-content, inverse
--- class: center, middle, remark-frontslide-content, inverse
--- class: middle
*... the findings suggest that the outcomes of learning a computer language go beyond the content of that specific computer language.* --- class: center, middle, remark-frontslide-content
--- class: middle, center ## Learn to talk to the IT people.
--- class: middle ## How do I learn to program? - Choose easy-to-learn and open source language like Python or R. - R preferable for advanced statistics and elaborate plotting. - Python preferable for data science and machine learning. - I consider Python as the clearer and more versatile programming language. - There are many books and online courses! --- class: center, middle ## Typical learning curve
--- class: center, middle, inverse # Now I know programming, what can go wrong?
Actually a lot!
--- class: center, middle, inverse # Now I know programming, what can go wrong? ## Actually a lot! --- class: middle ## What can go wrong? 1. Programs change over time. 1. Programs can break. 1. Code can be complex. 1. Programs will run on other computers. --- class: center, middle, inverse ## 1. Managing changes --- class: center, middle, remark-frontslide-content, inverse
--- class: middle ## Version control systems (VCS) - **time machines** for your source code and textual data. - `git` is the most common tool for tracking changes over time. - `git` ≠ `github`! - `github`, `gitlab`: web frontends for managing git repositories. - ETH has its own instance `gitlab.ethz.ch` for hosting code. --- class: middle ## git benefits - No version numbers in file names any more! - No comments to keep old and outdated code. - Undo changes. - Supports collaborative development. --- class: middle ## Version your software - Learn to write "packages" instead of emailing code. - Use semantic versioning `x.y.z`. - `x` for major updates (python2 and python3) - `y` for new features which don't crash existing results. - `z` is incremented for bug fixes. - "freeze" dependencies: document versions of external code. --- class: center, middle, inverse ## 2. Programs can be incorrect --- class: center, middle, remark-frontslide-content, inverse
--- class: middle ## Why? - You make mistakes during development. - Software complexity grows during development. - Others use your softare not as intended. --- class: middle ## Techniques - Defensive programming. ```python def average(data): assert len(data) > 0 ... ``` - Automated code tests: unit tests vs. regression tests. ```python def test_average(): assert average([1]) == 1 assert average([1, 2]) == 1.5 assert average([1, 2, 3]) == 2 ``` - A collection of unit tests is a *test suite*. --- class: middle, center, inverse ## 3. Code can complex. --- class: center, middle, remark-frontslide-content, inverse
--- class: middle ## Clean code ("you read code more often than you write it") - Choose good names for variables and functions. - Write many functions. - **DRY** (don't repeat yourself): Avoid duplications. - Write generic code: e.g. don't hard code file names. - Document your program incl. the underlying concepts. - unit tests enforce better code structure. - Read about "clean code". --- class: middle ## Other best practices - **KISS**: Keep it simple and stupid: Keep your solutions as simple as possible. - **YAGN**: You ain't gonna need it: Don't overdesign your programs. - *In the face of ambiguity, refuse the temptation to guess*: - Don't try to fix invalid input. - Complain instead! - Understand your programs vs *programming by coincidence*. - Be brave to trash your code and start again. --- class: middle, center, inverse ## 4. Programs will run in different environments --- class: middle, center, inverse ## Problem: ## Your program depends on other software ## Like: Python 3.6 or libraries --- class: middle ## How to check? - CI tests = *continuous integration tests* - Automates installation on pristine computer and running tests. - Can be integrated in `github.com`, `gitlab.com` or `gitlab.ethz.ch`. --- class: center, middle, remark-frontslide-content ## CI Pipeline in `gitlab`.
--- class: middle, center, inverse ## Sledge hammers for complex scenarios --- class: center, middle, remark-frontslide-content
--- class: middle ## Concepts - Idea: bundle your software and all dependencies - Virtual Machine (VM): bundle contains full operating system - Container: does not bundle operating system - `docker`: one way to manage and run containers. --- class: center, middle, remark-frontslide-content
--- class: middle, center ## Comparison VM vs Container | | Advantages | Disadvantages | |----------------- |------------ |--------------- | |
Virtual Machine
| Easy to setup | 10s of GB at least to ship
startup time: minutes
reduced performance | |
Container
| lightweight
startup time: milliseconds
native performance | Some learning involved,
Linux guest only | --- class: middle, center, inverse # All problems solved? --- class: center, middle, remark-frontslide-content, inverse
--- class: middle ## Computer arithmetic is not exact! ```python >>> from math import sin, pi >>> sin(pi) 1.2246467991473532e-16 >>> 0.1 + 0.2 + 0.3 0.6000000000000001 >>> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3) False ``` - Such behaviour for `+` ,`*`, `-` and `/` is standardized by *IEEE Standard for Floating-Point Arithmetic (IEEE 754)*. - But `exp` and other analytical functions not! --- class: middle #### I ran this on two computers with different CPUs ```python >>> "%.14e" % math.exp(-math.sin(431)) ```
1.7614414606499
7
e+00
```python >>> "%.14e" % math.exp(-math.sin(431)) ```
1.7614414606499
8
e+00
- This is very rare and actual effect (error propagation) needs mathematical analysis. - CI testing can help to detect such issues! --- class: middle ## Randomized algorithms - Most random numbers are *pseudo random numbers*. - Starting with a given "seed" the computer will always create the same random number sequence. - Freeze the seed when archiving / publishing your code. Also when unit testing. ```python >>> import random >>> random.seed(42) >>> random.random() 0.6394267984578837 ``` --- class: middle, inverse, center # But this is so much to learn ##
Learn incrementally
--- class: middle, inverse, center # But this is so much to learn ## Learn incrementally --- class: middle, inverse, center # But this costs so much time ##
Think about actual costs and risks
--- class: middle, inverse, center # But this costs so much time ## Think about actual costs and risks. --- class: inverse, center # Summary --- class: inverse, center # Summary ### Learn programming! --- class: inverse, center # Summary ### Learn programming! ### Use `git`! --- class: inverse, center # Summary ### Learn programming! ### Use `git`! ### Write robust and clean code! --- class: inverse, center # Summary ### Learn programming! ### Use `git`! ### Write robust and clean code! ### Implement automated code tests! --- class: inverse, center # Summary ### Learn programming! ### Use `git`! ### Write robust and clean code! ### Implement automated code tests! ### Use VM or containers! --- class: inverse, center, middle # Thanks for your attention!