from IPython.core.display import HTML

HTML(open("custom.html", "r").read())

Script 7

Recap from script 6

  • Opening files with open. open returns a file handle to work with the corresponding file.
  • File modes w and r.
  • Writing to files with write method
  • Writing to files with print
  • Reading from files with read
  • Iterating over lines of files with for
  • Using with to work with files.
  • Lists are defined with delimiters [ and ].
  • Indexed access to list elements with [..].
  • Reading and writing csv files with Pythons csv module.

More about lists

Up to now we learned that we can use for to iterate over range(..), over a string and over the lines of a file. This is not all, we can iterate over the elements of a list as well:

for x in [1, 4, 9]:
    print(x, x ** 2)
1 1
4 16
9 81

Lists have a method named append which allows to append a new element to an existing list:

li = [1, 2, 3]
print(li)
li.append(0)
print(li)
[1, 2, 3]
[1, 2, 3, 0]

This can be used to create lists starting with an empty list:

squares = []
for i in range(20):
    squares.append(i ** 2)
print(squares)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

This can be used to create a new list from a given one:

odd_squares = []
for value in squares:
    if value % 2 == 1:
        odd_squares.append(value)
print(odd_squares)
[1, 9, 25, 49, 81, 121, 169, 225, 289, 361]

Some list functions

Python offers functions min, max, sum and sorted:

print(min([2, 3, 1]))
1
print(max([2, 3, 1]))
3
print(sum([2, 3, 1]))
6
print(sorted([2, 3, 1]))
[1, 2, 3]

sorted returns a sorted list for the given values.

Exercise block 1

  1. Repeat the examples above.
  2. Create a list containing powers of $2$ starting with $2^0$ up to $2^{16}$ with for. What is the sum of these numbers ?

  3. Use for to transform the list from Exercise 1.2 to a new list only containing those numbers having four digits.

  4. Write a program which asks the user for numbers until he enters x. Store those numbers in a list, and finally print the minimum, maximum and average of the given numbers. Make sure that the program works if the user enters x at first.
  5. Repetition exercise: how to check if a given number is prime.
  6. Write a program which creates a list of prime numbers in the range 2...1000. Start with an empty list, iterate over numbers 2...1000 and extend the list of prime numbers as needed.
  7. Write a program which extracts the status lines of a given FASTA file into a list and finally writes them sorted to a text file.
  8. (Optional) Write a program which extracts the longest sequence(s) from a FASTA file and writes them to another FASTA file. Consider the case the multiple sequences have the same maximal length. Hint: Use three lists, one with the identifiers, one with the according sequence, and a third one with the length of the sequences.

Minimal plotting example using lists

If you followed the instructions in the script 02_introduction_pycharm you also installed an external library names matplotlib, which is the mostly used Python package for plotting.

If the following code fails with an ImportError please read the 02 script again.

This is an introductory example how to use matplotlib.

import matplotlib.pyplot as plt
import math

x_values = []
y_values = []
z_values = []

n_points = 100

# we create data to plot first:

for i in range(n_points + 1):
    
    # xi is in the range 0 ... 3 pi:
    xi = i * 3 * math.pi / n_points 
    
    # fist function to plot
    yi = math.sin(xi)
    
    # second function to plot:
    zi = math.cos(xi * 4) * math.exp(-0.4 * xi)  # dampened oscilation
   
    x_values.append(xi)
    y_values.append(yi)
    z_values.append(zi)
    
# now we plot.

# new plot of given size:
plt.figure(figsize=(12, 3))

plt.plot(x_values, y_values, label="sin")
plt.plot(x_values, z_values, "r.", label="damped")  # red dots
plt.plot(x_values, z_values, "k", linewidth=.5)     # thin black line

plt.title("plot demo")

# only vertical grid lines:
plt.grid(axis="x")

# plot legend for those curves with given label=... argument:
plt.legend()

# save it to file
plt.savefig("demo_plot.png")

# show it on your desktop
plt.show()

Exercise block 2

  1. Type and run the previous example and play with it.
  2. Repetition exercise: count the number of collatz iterations needed for a given start value n.
  3. Create a list which contains the number of collatz iterations for start values 2 ... 1000. Plot the these numbers as dots.

Comment

matplotlib is not complicated, but complex. matplotlib offers many kinds of plots which again offer many styling options. The previous example was just a quick introduction to show how to use this library. For more examples have a look at the gallery at the website of matplotlib. For complicated plots I usually copy one of the examples from the gallery and adopt it to my needs.

Splitting strings

Often you want to analyze strings by decomposing them into parts. If you look back to the status lines from the FASTA file we used in the previous script, eg >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA you might be interested in the proteind id 2765658. This is where the .split method of strings comes into play. The following example decomposes a sentence to a list of words:

print("my name is monty".split(" "))
['my', 'name', 'is', 'monty']

So X.split(Y) creates a list of strings being parts of X separted by Y.

description = ">gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA"
fields = description.split("|")
print(fields)
print(fields[1])
print(fields[2])
['>gi', '2765658', 'emb', 'Z78533.1', 'CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA']
2765658
emb

split() without argument splits for spaces and \n:

# do you remember multi line strings ? here is one:

txt = """this is some  text  some 
senseless text             indeed
"""

index = 0
for word in txt.split():
    print("word", index, "is", word)
    index = index + 1
word 0 is this
word 1 is is
word 2 is some
word 3 is text
word 4 is some
word 5 is senseless
word 6 is text
word 7 is indeed

Exercise block 3

  1. Repeat the examples before.
  2. Download https://siscourses.ethz.ch/python_dbiol/codons.txt and save it into your PyCharm project folder. Then write a program which extracts all codons (for non biologists: the words of length 3) from this file and prints them alphabetically sorted.
  3. Modify the previous program to write a csv file with two columns: the codons and the associated one letter amino-acid symbols following the codons in the text file. It is not required that the codons are sorted.

Accessing elements of a list

Similar to access elements of a string, square brackets can be used to access elements of parts of a list:

li = ["a", "b", 3, 4]
print(li[0])
print(li[-1])
a
4

Slicing (was introduced in the script about strings !) works as well:

print(li[1:-1])
['b', 3]

Finding a value in a list

Lists have a method named index to find the position of a value in a list:

print(odd_squares.index(25))
2

Regrettably this method results in an error message if the element you are looking for is not present:

print(odd_squares.index(4))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-4035a17a631a> in <module>()
----> 1 print(odd_squares.index(4))

ValueError: 4 is not in list

To avoid that you can use the in operator, which computes True or False:

print(4 in odd_squares)
False

The negation is not in:

print(3 not in odd_squares)
True

So you can implement a robust lookup like this:

data = [2, 3, 5, 7, 11, 13]

number = int(input("number to lookup: "))
if number in data:
    print("found number at position", data.index(number))
else:
    print("number not found")
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
<ipython-input-20-c5744dc8b20b> in <module>()
      1 data = [2, 3, 5, 7, 11, 13]
      2 
----> 3 number = int(input("number to lookup: "))
      4 if number in data:
      5     print("found number at position", data.index(number))

/Users/uweschmitt/Projects/python3-course/venv/lib/python3.6/site-packages/ipykernel/kernelbase.py in raw_input(self, prompt)
    687         if not self._allow_stdin:
    688             raise StdinNotImplementedError(
--> 689                 "raw_input was called, but this frontend does not support input requests."
    690             )
    691         return self._input_request(str(prompt),

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.

Exercise block 4

  1. Type and run the examples.
  2. The Fibonacci sequence starts with numbers 1, 1, then every number after the first two is the sum of the two preceding ones. So the sequence is 1, 1, 2, 3, 5, 8, 13, ....

    Compute the first 100 numbers of the Fibonacci sequence using a list + for. Start with the list [1, 1]. Using negative indexes will make your life easier !

Exercise 4.3

Extend the amino acid csv file reading exercise from the last script to create two lists: the first list contains the one letter codes, the second list the corresponding average masses as float(!) values.

You have to start with two empty lists which you extend when iterating over the lines of the csv file. For every row in the csv file you have to pick the required values from the row and append them to the two lists for symbols and masses.

Exercise 4.4

Now extend this solution to ask the user for a one letter code and then print the average mass of the amino acid.

Do this calculation using the two lists. (Hint: first find the position of the one letter code in the first list, then look up the average mass at the same position).

Exercise 4.5

Final extension: Ask the user for a amino acid sequence and compute the overall weight of the sequence. Do not forget to subtract the water loss which is n - 1 times the mass of water for a peptide having n amino acids.

(Hint: you need an extra "outer" for symbol in sequence: to iterate over the symbols of the given sequence. The body of this loop then looks up the particular average masses as done in 6. and sums them up)

Comment: The strategy in exercises 4.3 to 4.4 should be preferred to the solution from the last script. Reading large files might be slow and if you have to lookup data multiple times it is more efficient to first load the interesting data into a container (like a list) and the do the lookups on this extracted data.

Introduction to dictionaries

A dictionary represents a mapping which you can imagine as a look-up-table. The elements of a dictionary are key-value pairs. So for example the mapping

key value
duke ellington
charles mingus
john coltrane

allows you to lookup the second name ("value") if you know the first name ("key") of a famous jazz player. This is what dictionaries are about (not jazz, but "lookup").

The same table is written in Python as follows:

first_to_second_name = {
    "duke" : "ellington",
    "charles" : "mingus",
    "john" : "coltrane"
}

Exercise 5

  • type the dictionary above before you go on in the script

As you can see a dictionary is delimited by curly braces { and } and the keys and values are separated by :. The rows of the lookup table are separated by ,.

If you now have a dictionary you use [ and ] to lookup a value for a given key:

print(first_to_second_name["john"])
coltrane

If you try to lookup a value for a non-existing key you will get an error message:

print(first_to_second_name["thomas"])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-b049c6dc9b8c> in <module>()
----> 1 print(first_to_second_name["thomas"])

KeyError: 'thomas'

We have seen square brackets in different situations:

  • fetching characters or parts of a string
  • fetching elements from a list
  • dictionary lookup

Here comes another:

If you see a dictionary followed by [...] on the left side of an assignment = this will not lookup a value but write a new entry into the dictionary:

first_to_second_name["thomas"] = "peterson"

After this assignment the lookup table will be:

key value
duke ellington
charles mingus
john coltrane
thomas peterson

You can inspect a dictionary with print:

print(first_to_second_name)
{'duke': 'ellington', 'charles': 'mingus', 'john': 'coltrane', 'thomas': 'peterson'}

If you look carefully you will see that the order of the lines of the table is not the same as in the displayed dictionary !

But:

  • the one and only use case for a dictionary is to look up a value for a given key and for such a lookup the ordering of the rows in the table does not matter, and
  • Python internally holds a data structure which messes up the row ordering and but is optimized for lookup-speed, you can have dictionaries with billions of entries and looking up values will still be very fast.

After inserting the new entry the previously failing lookup works:

print(first_to_second_name["thomas"])
peterson

Dictionary keys and values may have an arbitrary type and are not limited to strings as we demonstrated in the examples above:

weird_dict = {1 : "13", "a" : 7, "" : True}
print(weird_dict[""])
True

You can see that the types are mixed and that the notation to declare the dictionary is a bit different. In general indentation does not matter for declaring dictionaries. But having every key/value pair is more readable.

The empty dictionary is denoted as {} and we write key value pairs into this:

squares = {}
for i in range(1, 5):
    squares[i] = i * i
print(squares)

print(squares[3])
{1: 1, 2: 4, 3: 9, 4: 16}
9

Exercise block 6

  1. Write a program which creates a dictionary mapping number 1 .. 10 to their doubled value.
  2. Modify your solution of exercise 3.3 to create a dictionary mapping codon to amino acid symbol.
  3. Modify your solution of exercise 6.7 from the last script to create dictionary mapping symbols to masses.

Dictionaries have methods .keys and .values:

print(first_to_second_name.keys())
dict_keys(['duke', 'charles', 'john', 'thomas'])
print(first_to_second_name.values())
dict_values(['ellington', 'mingus', 'coltrane', 'peterson'])

Although the output indicates that .keys and .values do not return Python lists, both behave in many cases like list. So we can:

  • iterate over them
  • use in for checking membership
for key in first_to_second_name.keys():
    print(key)
duke
charles
john
thomas
print("thomas" in first_to_second_name.values())
False

Again the negation of in is not in which tests for "non-membership":

print("uwe" not in first_to_second_name.values())
True

Counting with dictionaries

In order to create a dictionary which maps words in a given text to the word counts ("word histogram") we combine:

  • splitting a text into words using split(" ")
  • using a dictionary to store word occurrences
  • testing if a given key is already present in a dictionary

The strategy is to build a dictionary which maps every word in the text to its count. To do this we:

  1. we start with an empty dictionary
  2. then we iterate over all words from the text. For every word we increment this counter.
  3. if we see a word the first time we write the count 1 to the dictionary
  4. else we increment the existing count by 1.

(This strategy is needed because we do not know in advance which words we will have to count)

And this is the implementation:

txt = "this is some text some senseless text indeed"

counts = {}
for word in txt.split(" "):
    if word not in counts.keys():
        counts[word] = 1
    else:
        counts[word] = counts[word] + 1
        
print(counts)
{'this': 1, 'is': 1, 'some': 2, 'text': 2, 'senseless': 1, 'indeed': 1}

Above we see in counts[word] = counts[word] + 1 that dictionary access with square brackets [...] has two meanings:

  • reading if this occurs on the right side of =
  • writing if this occurs on the left side of =.

So this statement fetches the current count of word, increases this count by 1 and then writes the new value back into the dictionary.

Exercise block 7

  1. Reproduce the examples above. As always: do no use copy/paste !!
  2. Modify the script to count all symbols in a given string (You remember that you can use for to iterate over the characters of a string?)

  3. (optional) First implement code which splits a given RNA sequence into a list of codons (Hint: range(a, b, c) counts from a to b with step-size c) and then extend this to translate a given RNA sequence to the corresponding amino acid sequence. (You may use 'slicing' as explained in the script about strings)

(Optional) Dictionaries for grouping

The values in a dictionary can be arbitrary (for keys there are some type restrictions), so they can be lists or other dictionaries as you can see in the following example:

dd = {3 : [5, 6], "4" : 2, False: {"a": 4}}
print(dd)
print(dd[3][0])
print(dd[False]["a"])
{3: [5, 6], '4': 2, False: {'a': 4}}
5
4

Task

Suppose we have two lists: one list of values and a list of same length indicating the group id of the corresponding value.

Example: The group identifier could correspond to a certain experimental condition and we want to implement a Python script to compute the average value per group.

values = [1, 2, 3, 4, 5]
groups = [1, 0, 0, 1, 1]

This means that values 2 and 3 belong to group 0 and 1, 4 and 5 to group 1.

For this kind of data analysis the first task is to split the measured values and collect them group-wise.

To do this with Python we can use a dictionary. Every key in this dictionary will be a group identifier, every dictionary value is a list of numbers in the corresponding group. This is what we want to compute:

key value
0 [2, 3]
1 [1, 4, 5]

The following script computes the wanted dictionary, the strategy is similar to the histogram computations we did above:

groups = [1, 0, 0, 1, 1]
values = [1, 2, 3, 2, 7]

assignments = {}
for i in range(len(values)):
    group = groups[i]
    value = values[i]
    if group not in assignments.keys():
        assignments[group] = []
    assignments[group].append(value)   # read comment below

print(assignments)
{1: [1, 2, 7], 0: [2, 3]}

The line assignments[group].append(value) says: fetch the existing list for key group and append a new value to this list.

(Optional) Exercise block 8

  1. Type the last script and try to understand it ! (It might help to add print statements to display interesting values during every iteration).
  2. Append own code to the script to compute group averages based on the assignments dictionary. Hint: write a for loop over the keys of assignments, then sum(...) which computes the sum of values in a given list.

  3. Extend it to work on a given csv file having two columns for group and value.

  4. Extend it to create a new csv file which extends the given two columns by a new column with (maybe repeated) average values. So the result could look like:
group value group average
0 1 3
0 3 3
3 2 3
3 4 3
0 5 3
4 6 6