from IPython.core.display import HTML

HTML(open("custom.html", "r").read())


# Script 9: miscellaneous¶

### Basic introduction to regular expressions¶

So called regular expressions describe patterns which can be used to search strings.

Such a pattern is a string, but some characters as .[]?+*()^ have special meanings.

Python has the re module to work with regular expressions, the function re.search looks for the first occurence of the pattern in the given string:

import re

sequence = "ATGCATGC"
pattern = "GCA"         # pattern withouth special character !

match = re.search(pattern, sequence)
if match != None:
print("first match:", match.group(), match.start(), match.end())
else:
print("no match")

first match: GCA 2 5


re.search returns either None or a "match object". As you can see in the example above this object has some methods to give you more information about the match.

The special symbol . encodes "any character". So you can see that GC. matches GCA:

sequence = "ATGCATGC"
pattern = "GC."

match = re.search(pattern, sequence)
if match != None:
print("first match:", match.group(), match.start(), match.end())
else:
print("no match")

first match: GCA 2 5


[...] means "exactly one of the characters between the brackets":

print(re.search("GC[ATG]", "GCAG") != None)
print(re.search("GC[ATG]", "GCTG") != None)
print(re.search("GC[ATG]", "GCCG") != None)

True
True
False


To find all occurences you can use finditer:

for match in re.finditer("GC[AT]", "GCAGAAGCTGCC"):
print(match.group(), match.start(), match.end())

GCA 0 3
GCT 6 9

for match in re.finditer("GC.", "GCAGAAGCTGCC"):
print(match.group(), match.start(), match.end())

GCA 0 3
GCT 6 9
GCC 9 12


The expression | means "either or":

print(re.search("A(bcd|BCD)E", "ABCDE") != None)
print(re.search("A(bcd|BCD)E", "AbcdE") != None)
print(re.search("A(bcd|BCD)E", "ABcdE") != None)

True
True
False


### Exercises¶

• reproduce the previous examples and play with them

## More about range¶

range accepts zero to two arguments:

• range(n) counts from 0 to n - 1.
• range(m, n) counts from m to n - 1
• range(m, n, k) counts from m to n - 1 with step size k
for i in range(2, 11, 3):
print(i)

2
5
8

for i in range(6, 3, -1):
print(i)

6
5
4


We now combine slicing and range with three arguments to split a sequence into codons:

def split_codons(sequence):
codons = []
for start in range(0, len(sequence), 3):
codon = sequence[start:start + 3]
codons.append(codon)
return codons

print(split_codons("ATGCATGCA"))

['ATG', 'CAT', 'GCA']


## Exercises¶

1. Reproduce the examples above

## Functions with multiple return values¶

Functions may compute more than one result. In this case you have to list the values separated by , after return:

def sum_and_diff(a, b):
return a + b, a - b

x, y = sum_and_diff(10, 3)
print(x, y)

13 7


You can see above that we have to use x, y = sum_and_diff(10, 3) to receive both return values in given order. x will be the value of a + b and y will be a - b.

## Algebraic updates¶

Python provides some shortcuts for common algebraic operations:

• x += m is the same as x = x + m
• x -= m is the same as x = x - m
• x *= m is the same as x = x * m
• x /= m is the same as x = x / m

Here is an example which combines algebraic updates and multiple return values:

def statistics(li):
count = 0
acc = 0
for element in li:
count += 1
acc += element
return count, acc, acc / count

count, sum, average = statistics([1, 2, 3, 4, 5, 6])
print("counted", count, "numbers having sum", sum, "and average", average)

counted 6 numbers having sum 21 and average 3.5


## Exercise block 3¶

1. Reproduce the examples above
2. Write a function which returns minimum and maximum of a given list of numbers

## More about print¶

We already introduced the form print(..., file=fh) to write to a file instead displaying the output on the console. Beyond that print has other (so called) named arguments.

Before we introduce other named arguments, we state:

• every print will automatically output a \n after execution, so the next print will display output on a new line
• a plain print() creates an empty line
• print with mutliple arguments will separate them by a single space " ":

You can observe this in the following example:

print(3)
print()
print(1, 2, 3)

3

1 2 3


To modify some of these properities the named parameters sep and end come into play. sep allows other separators than a single blank character:

print(1, 2, 3, sep=", ")

1, 2, 3


And end allows to override the default \n to avoid the line break:

print(1, 2, 3, end=" ")
print(4)
print(1, 2, 3, end="")
print(4)

1 2 3 4
1 2 34


We can use this to print a multiplication table:

def print_table(up_to):
for row in range(1, up_to + 1):
for col in range(1, up_to + 1):
cell_value = row * col
print(cell_value, end=" ")
print()

print_table(9)

1 2 3 4 5 6 7 8 9
2 4 6 8 10 12 14 16 18
3 6 9 12 15 18 21 24 27
4 8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81


The table looks a bit ugly, so we add a leading space for numbers with one digit:

def print_pretty_table(up_to):
for row in range(1, up_to + 1):
for col in range(1, up_to + 1):
cell_value = row * col
if cell_value < 10:
print(" ", end="")
print(cell_value, end=" ")
print()

print_pretty_table(9)

 1  2  3  4  5  6  7  8  9
2  4  6  8 10 12 14 16 18
3  6  9 12 15 18 21 24 27
4  8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81


## String interpolation¶

The expression you see between the brackets of print is called "string interpolation":

def greet(name):
print("hi %s how do you do" % name)

greet("bart simpson")

hi bart simpson how do you do


Explanation: The %s is a place holder, and when Python evaluates "hi %s how do you do" % name the value of name will be inserted at this position.

For multiple place holders in the template you have to use round brackets after %, place holders are filled with the given values in order:

def print_sum(a, b):
print("%s plus %s is %s" % (a, b, a + b))

print_sum(3, 4)

3 plus 4 is 7


%s is one of many possible place holders. Another place holder has the form %.nf where n is a number and f is fixed, formats a number with n digits after the decimal point:

import math

# print only two numbers after decimal point
print("pi with two digits is %.2f" % math.pi)

pi with two digits is 3.14

print("%.1f" % 3)

3.0


The form %m.nf formats with n numbers after the decimal point and pads spaces from the left so that the full result is at least m characters wide:

print("%5.2f" % math.pi)
print("%5.2f" % 11.3)
print("%5.2f" % 121.3)

 3.14
11.30
121.30


For integer numbers you can use %nd for padding up to n characters:

print("%3d" % 4)
print("%3d" % 44)

  4
44


And e and variations print in scientific notation:

print("%e" % math.pi)
print("%.2e" % math.pi)
print("%10.2e" % math.pi)

3.141593e+00
3.14e+00
3.14e+00


## Exercise block 4¶

1. Reproduce the examples above and try to understand how the multiplication table examples work !
2. Can you rewrite print_pretty_table without the if by using string interpolation instead ?
3. Write a function print_triangle(n) which displays a symmetric triangle of height n. Assume that n is an odd number. So print_triangle(5) displays

    *
***
*****
*******
*********

## More about lists¶

For common operations on lists Python provides some convenience functions. They replace some computations we exercised up to now.

To compute the maximum or minimum of a given list max and min functions exist:

print(max([2, 3, 1]))

3

print(min([2, 3, 1]))

1


Beyond that sorted computes a sorted list from a given list:

print(sorted([3, 1, 2]))

[1, 2, 3]


Many tasks with lists as transforming or filtering a list can be expressed by so called list comprehensions. Here are a few examples:

numbers = range(10)
squared_numbers = [n ** 2 for n in numbers]
print(squared_numbers)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


This is a shortcut for:

numbers = range(10)
squared_numbers = []
for n in numbers:
squared_numbers.append(n ** 2)
print(squared_numbers)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


A more complex list comprehension is:

numbers = range(10)
squared_even_numbers = [n ** 2 for n in numbers if n % 2 == 0]
print(squared_even_numbers)

[0, 4, 16, 36, 64]


This is an shortcut for:

numbers = range(10)
squared_even_numbers = []
for n in numbers:
if n % 2 == 0:
squared_even_numbers.append(n ** 2)
print(squared_even_numbers)

[0, 4, 16, 36, 64]


Sometimes we want to iterate over two iterables at the same time. This can be implemented with the zip function:

values = [1, 2, 1, 2, 1, 2]
groups = [0, 0, 0, 1, 1, 1]

for v, g in zip(values, groups):
print("value", v, "is in group", g)

value 1 is in group 0
value 2 is in group 0
value 1 is in group 0
value 2 is in group 1
value 1 is in group 1
value 2 is in group 1


When "zipping" the shorter iterable determines when looping ends:

for i, j in zip(range(4), range(2)):
print(i)

0
1


To retrieve an iteraton index enumerate helps:

with open("amino_acids.csv", "r") as fh:
for i, line in enumerate(fh):
print("line", i, "has length", len(line.rstrip()))
if i >= 10:
break

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-34-2b65d046c1ff> in <module>()
----> 1 with open("amino_acids.csv", "r") as fh:
2     for i, line in enumerate(fh):
3         print("line", i, "has length", len(line.rstrip()))
4         if i >= 10:
5             break

FileNotFoundError: [Errno 2] No such file or directory: 'amino_acids.csv'

## Exercise block 5¶

1. Reproduce the examples above
2. Rewrite the script for grouping with dictionaries using zip to avoid indexed access with [].
3. (optional) Write a script which creates a plain FASTA text file which contains the 3 longest sequences a given input FASTA file. It is ok if you write the sequence as one line in the output file. In case of duplicate lengthts, like 10, 10, 9, 9, 8 you can write the top 4 sequences from this list. Hint: create a list with the lengths of the sequences, sort it and find so a threshold value for filtering the given sequences.

## Basic Plotting¶

The next section demonstrates how to plot diagrams with Python. The used library http://matplotlib.org/ is comprehensive, we just can show a few examples here. Look at the gallery http://matplotlib.org/gallery.html to see what this library offers.

## Before you read on¶

1. Check if import pylab works
2. If you see an ImportError exception read in 02_introduction_pycharm in section "Install extra libraries" how to install the missing library

### Instructions:¶

• reproduce the following examples on your computer
• play with the examples, I included some suggestions as comments in the code snippets
import pylab
import random

x_values = []
for i in range(1000):
random_number = random.normalvariate(0, 100.0) # mean 0.0 std deviation 100.0
x_values.append(random_number)

"""
play with the parameters "bins" and "color" !!!
"""
pylab.hist(x_values, bins=20, color="b")
pylab.show()

import pylab
import random

x_values = []
y_values = []
for i in range(1000):
x = random.normalvariate(0, 100.0) # mean 0.0 std deviation 100.0
y = x + random.normalvariate(0.0, 30.0)

x_values.append(x)
y_values.append(y)

"""
what happens if you use "b*" instead of "r." ??
"""
pylab.plot(x_values, y_values, "r.")   # red dots
pylab.show()


Now we group four plots:

# set the overall size of the next plot in (width, height) inches
pylab.figure(figsize=(18, 4))

pylab.subplot(1, 4, 1)   # one row, four columns, plot number 1
pylab.plot(x_values, 'y')

pylab.subplot(1, 4, 2)
pylab.hist(x_values, bins=20, color="r")

pylab.subplot(1, 4, 3)
pylab.hist(y_values, bins=20, color="g")

pylab.subplot(1, 4, 4)
pylab.plot(x_values, y_values, "b.")

pylab.show()

"""suggestions for playing:
- arange the plots in two rows and two columns
- play with the figsize paramter above
- choose "random_hist.pdf" as a file name below !
"""

# we can save this plot to disk, you should find the file in PyCharms project folder afterwards.
# If you want to save only, you can ommit the "pylab.show()" above !
pylab.savefig("random_hist.png")


## Numpy¶

numpy provides data containers for vectors and matrices:

import numpy

x = numpy.arange(0.0, 6.0, 0.25)  # similar to range: arguments are start, excluive upper limit, stepsize
print(repr(x))

array([0.  , 0.25, 0.5 , 0.75, 1.  , 1.25, 1.5 , 1.75, 2.  , 2.25, 2.5 ,
2.75, 3.  , 3.25, 3.5 , 3.75, 4.  , 4.25, 4.5 , 4.75, 5.  , 5.25,
5.5 , 5.75])


numpy vectors are more handy the usual Python lists, so we can apply a single numpy function elementwise to a vector like this:

# to apply a function to all elements of a vector:
y = numpy.sin(x)


(For large vectors this is much faster than performing the same operation with a for loop over lists)

print(y)  # without repr it prints as a regular list, but is still is an numpy array !

[ 0.          0.24740396  0.47942554  0.68163876  0.84147098  0.94898462
0.99749499  0.98398595  0.90929743  0.7780732   0.59847214  0.38166099
0.14112001 -0.10819513 -0.35078323 -0.57156132 -0.7568025  -0.89498936
-0.97753012 -0.99929279 -0.95892427 -0.85893449 -0.70554033 -0.50827908]

y2 = numpy.sin(.75 * x)
pylab.plot(x, y, label="sine")
pylab.plot(x, y2, label="slower sine")
pylab.title("two sines")
pylab.legend()
pylab.show()


### Heatmap¶

import numpy
import pylab

# we use the next statement to adapt the output format if we print numpy data)
numpy.set_printoptions(precision=4, linewidth=100)

# we create a 10 x 10 matrix with random values in range 0.0 to 1.0:
matrix = numpy.random.random((10,10))
print(matrix)

pylab.figure(figsize=(5, 5))  # so we enforce equal sized axes, remove this line and see the difference !
pylab.pcolor(matrix, cmap=pylab.cm.hot)

# play with other colormaps ! (see http://matplotlib.org/examples/color/colormaps_reference.html)
pylab.show()

[[0.8745 0.2913 0.6812 0.8116 0.5561 0.3211 0.3627 0.7007 0.1928 0.4708]
[0.8983 0.4427 0.4127 0.5456 0.4444 0.0888 0.3685 0.0462 0.9977 0.3371]
[0.037  0.2489 0.7584 0.1951 0.5764 0.8399 0.9789 0.0705 0.7282 0.3371]
[0.7199 0.2629 0.9994 0.3374 0.0374 0.8804 0.3298 0.941  0.0133 0.0496]
[0.1104 0.1201 0.1918 0.3    0.0212 0.7441 0.8856 0.2191 0.6653 0.8743]
[0.9103 0.8107 0.9036 0.5548 0.7513 0.379  0.0141 0.5196 0.3581 0.1641]
[0.0494 0.0132 0.8235 0.1342 0.4745 0.5146 0.6767 0.7864 0.2055 0.726 ]
[0.6917 0.3839 0.0767 0.4964 0.3553 0.8096 0.5948 0.5142 0.8777 0.3139]
[0.1228 0.7978 0.5501 0.8882 0.9072 0.7104 0.7393 0.1844 0.5129 0.8373]
[0.8465 0.7171 0.9909 0.7896 0.4152 0.5469 0.3418 0.4266 0.7667 0.1618]]


Try to identify the colors with cell numbers above, what do you observe ???

## Exercise block 5¶

1. Reproduce all examples above, play with them !

2. Plot two histograms in a row: a histogram of the sequence lengths in a given the FASTA file and another histrogram of the AG content of the sequences.