from IPython.core.display import HTML
HTML(open("custom.html", "r").read())
open
. open
returns a file handle to work with the corresponding file.w
and r
.write
methodprint
read
for
with
to work with files.[
and ]
.[..]
.csv
files with Pythons csv
module.Up to now we learned that we can use for
to iterate over range(..)
, over a string and over the lines of a file. This is not all, we can iterate over the elements of a list as well:
for x in [1, 4, 9]:
print(x, x ** 2)
Lists have a method named append
which allows to append a new element to an existing list:
li = [1, 2, 3]
print(li)
li.append(0)
print(li)
This can be used to create lists starting with an empty list:
squares = []
for i in range(20):
squares.append(i ** 2)
print(squares)
This can be used to create a new list from a given one:
odd_squares = []
for value in squares:
if value % 2 == 1:
odd_squares.append(value)
print(odd_squares)
Python offers functions min
, max
, sum
and sorted
:
print(min([2, 3, 1]))
print(max([2, 3, 1]))
print(sum([2, 3, 1]))
print(sorted([2, 3, 1]))
sorted
returns a sorted list for the given values.
Create a list containing powers of $2$ starting with $2^0$ up to $2^{16}$ with for
. What is the sum of these numbers ?
Use for
to transform the list from Exercise 1.2 to a new list only containing those numbers having four digits.
x
. Store those numbers in a list, and finally print the minimum, maximum and average of the given numbers. Make sure that the program works if the user enters x
at first.If you followed the instructions in the script 02_introduction_pycharm
you also installed an external library names matplotlib
, which is the mostly used Python package for plotting.
If the following code fails with an ImportError
please read the 02
script again.
This is an introductory example how to use matplotlib
.
import matplotlib.pyplot as plt
import math
x_values = []
y_values = []
z_values = []
n_points = 100
# we create data to plot first:
for i in range(n_points + 1):
# xi is in the range 0 ... 3 pi:
xi = i * 3 * math.pi / n_points
# fist function to plot
yi = math.sin(xi)
# second function to plot:
zi = math.cos(xi * 4) * math.exp(-0.4 * xi) # dampened oscilation
x_values.append(xi)
y_values.append(yi)
z_values.append(zi)
# now we plot.
# new plot of given size:
plt.figure(figsize=(12, 3))
plt.plot(x_values, y_values, label="sin")
plt.plot(x_values, z_values, "r.", label="damped") # red dots
plt.plot(x_values, z_values, "k", linewidth=.5) # thin black line
plt.title("plot demo")
# only vertical grid lines:
plt.grid(axis="x")
# plot legend for those curves with given label=... argument:
plt.legend()
# save it to file
plt.savefig("demo_plot.png")
# show it on your desktop
plt.show()
n
.matplotlib
is not complicated, but complex. matplotlib
offers many kinds of plots which again offer many styling options. The previous example was just a quick introduction to show how to use this library. For more examples have a look at the gallery at the website of matplotlib
. For complicated plots I usually copy one of the examples from the gallery and adopt it to my needs.
Often you want to analyze strings by decomposing them into parts. If you look back to the status lines from the FASTA file we used in the previous script, eg >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
you might be interested in the proteind id 2765658
. This is where the .split
method of strings comes into play. The following example decomposes a sentence to a list of words:
print("my name is monty".split(" "))
So X.split(Y)
creates a list of strings being parts of X
separted by Y
.
description = ">gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA"
fields = description.split("|")
print(fields)
print(fields[1])
print(fields[2])
split()
without argument splits for spaces and \n
:
# do you remember multi line strings ? here is one:
txt = """this is some text some
senseless text indeed
"""
index = 0
for word in txt.split():
print("word", index, "is", word)
index = index + 1
Similar to access elements of a string, square brackets can be used to access elements of parts of a list:
li = ["a", "b", 3, 4]
print(li[0])
print(li[-1])
Slicing (was introduced in the script about strings !) works as well:
print(li[1:-1])
Lists have a method named index
to find the position of a value in a list:
print(odd_squares.index(25))
Regrettably this method results in an error message if the element you are looking for is not present:
print(odd_squares.index(4))
To avoid that you can use the in
operator, which computes True
or False
:
print(4 in odd_squares)
The negation is not in
:
print(3 not in odd_squares)
So you can implement a robust lookup like this:
data = [2, 3, 5, 7, 11, 13]
number = int(input("number to lookup: "))
if number in data:
print("found number at position", data.index(number))
else:
print("number not found")
The Fibonacci sequence starts with numbers 1
, 1
, then every number after the first two is the sum of the two preceding ones. So the sequence is 1, 1, 2, 3, 5, 8, 13, ...
.
Compute the first 100
numbers of the Fibonacci sequence using a list + for
. Start with the list [1, 1]
. Using negative indexes will make your life easier !
Extend the amino acid csv file reading exercise from the last script to create two lists: the first list contains the one letter codes, the second list the corresponding average masses as float
(!) values.
You have to start with two empty lists which you extend when iterating over the lines of the csv file. For every row in the csv file you have to pick the required values from the row and append them to the two lists for symbols and masses.
Now extend this solution to ask the user for a one letter code and then print the average mass of the amino acid.
Do this calculation using the two lists. (Hint: first find the position of the one letter code in the first list, then look up the average mass at the same position).
Final extension: Ask the user for a amino acid sequence and compute the overall weight of the sequence. Do not forget to subtract the water loss which is n - 1
times the mass of water for a peptide having n
amino acids.
(Hint: you need an extra "outer" for symbol in sequence:
to iterate over the symbols of the given sequence. The body of this loop then looks up the particular average masses as done in 6. and sums them up)
Comment: The strategy in exercises 4.3 to 4.4 should be preferred to the solution from the last script. Reading large files might be slow and if you have to lookup data multiple times it is more efficient to first load the interesting data into a container (like a list) and the do the lookups on this extracted data.
A dictionary represents a mapping which you can imagine as a look-up-table. The elements of a dictionary are key-value pairs. So for example the mapping
key | value |
---|---|
duke | ellington |
charles | mingus |
john | coltrane |
allows you to lookup the second name ("value") if you know the first name ("key") of a famous jazz player. This is what dictionaries are about (not jazz, but "lookup").
The same table is written in Python as follows:
first_to_second_name = {
"duke" : "ellington",
"charles" : "mingus",
"john" : "coltrane"
}
As you can see a dictionary is delimited by curly braces {
and }
and the keys and values are separated by :
. The rows of the lookup table are separated by ,
.
If you now have a dictionary you use [
and ]
to lookup a value for a given key:
print(first_to_second_name["john"])
If you try to lookup a value for a non-existing key you will get an error message:
print(first_to_second_name["thomas"])
We have seen square brackets in different situations:
Here comes another:
If you see a dictionary followed by [...]
on the left side of an assignment =
this will not lookup a value but write a new entry into the dictionary:
first_to_second_name["thomas"] = "peterson"
After this assignment the lookup table will be:
key | value |
---|---|
duke | ellington |
charles | mingus |
john | coltrane |
thomas | peterson |
You can inspect a dictionary with print
:
print(first_to_second_name)
If you look carefully you will see that the order of the lines of the table is not the same as in the displayed dictionary !
But:
After inserting the new entry the previously failing lookup works:
print(first_to_second_name["thomas"])
Dictionary keys and values may have an arbitrary type and are not limited to strings as we demonstrated in the examples above:
weird_dict = {1 : "13", "a" : 7, "" : True}
print(weird_dict[""])
You can see that the types are mixed and that the notation to declare the dictionary is a bit different. In general indentation does not matter for declaring dictionaries. But having every key/value pair is more readable.
The empty dictionary is denoted as {}
and we write key value pairs into this:
squares = {}
for i in range(1, 5):
squares[i] = i * i
print(squares)
print(squares[3])
Dictionaries have methods .keys
and .values
:
print(first_to_second_name.keys())
print(first_to_second_name.values())
Although the output indicates that .keys
and .values
do not return Python lists, both behave in many cases like list. So we can:
in
for checking membershipfor key in first_to_second_name.keys():
print(key)
print("thomas" in first_to_second_name.values())
Again the negation of in
is not in
which tests for "non-membership":
print("uwe" not in first_to_second_name.values())
In order to create a dictionary which maps words in a given text to the word counts ("word histogram") we combine:
split(" ")
The strategy is to build a dictionary which maps every word in the text to its count. To do this we:
1
to the dictionary1
.(This strategy is needed because we do not know in advance which words we will have to count)
And this is the implementation:
txt = "this is some text some senseless text indeed"
counts = {}
for word in txt.split(" "):
if word not in counts.keys():
counts[word] = 1
else:
counts[word] = counts[word] + 1
print(counts)
Above we see in counts[word] = counts[word] + 1
that dictionary access with square brackets [...]
has two meanings:
=
=
.So this statement fetches the current count of word
, increases this count by 1
and then writes the new value back into the dictionary.
Modify the script to count all symbols in a given string (You remember that you can use for
to iterate over the characters of a string?)
(optional) First implement code which splits a given RNA sequence into a list of codons (Hint: range(a, b, c)
counts from a
to b
with step-size c
) and then extend this to translate a given RNA sequence to the corresponding amino acid sequence. (You may use 'slicing' as explained in the script about strings)
The values in a dictionary can be arbitrary (for keys there are some type restrictions), so they can be lists or other dictionaries as you can see in the following example:
dd = {3 : [5, 6], "4" : 2, False: {"a": 4}}
print(dd)
print(dd[3][0])
print(dd[False]["a"])
Suppose we have two lists: one list of values and a list of same length indicating the group id of the corresponding value.
Example: The group identifier could correspond to a certain experimental condition and we want to implement a Python script to compute the average value per group.
values = [1, 2, 3, 4, 5]
groups = [1, 0, 0, 1, 1]
This means that values 2
and 3
belong to group 0
and 1
, 4
and 5
to group 1
.
For this kind of data analysis the first task is to split the measured values and collect them group-wise.
To do this with Python we can use a dictionary. Every key in this dictionary will be a group identifier, every dictionary value is a list of numbers in the corresponding group. This is what we want to compute:
key | value |
---|---|
0 |
[2, 3] |
1 |
[1, 4, 5] |
The following script computes the wanted dictionary, the strategy is similar to the histogram computations we did above:
groups = [1, 0, 0, 1, 1]
values = [1, 2, 3, 2, 7]
assignments = {}
for i in range(len(values)):
group = groups[i]
value = values[i]
if group not in assignments.keys():
assignments[group] = []
assignments[group].append(value) # read comment below
print(assignments)
The line assignments[group].append(value)
says: fetch the existing list for key group
and append a new value to this list.
print
statements to display interesting values during every iteration).Append own code to the script to compute group averages based on the assignments
dictionary. Hint: write a for
loop over the keys of assignments
, then sum(...)
which computes the sum of values in a given list.
Extend it to work on a given csv file having two columns for group and value.
group | value | group average |
---|---|---|
0 | 1 | 3 |
0 | 3 | 3 |
3 | 2 | 3 |
3 | 4 | 3 |
0 | 5 | 3 |
4 | 6 | 6 |