from IPython.core.display import HTML
HTML(open("custom.html", "r").read())
So called regular expressions describe patterns which can be used to search strings.
Such a pattern is a string, but some characters as .[]?+*()^
have special meanings.
Python has the re
module to work with regular expressions, the function re.search
looks for the first occurence of the pattern in the given string:
import re
sequence = "ATGCATGC"
pattern = "GCA" # pattern withouth special character !
match = re.search(pattern, sequence)
if match != None:
print("first match:", match.group(), match.start(), match.end())
else:
print("no match")
re.search
returns either None
or a "match object". As you can see in the example above this object has some methods to give you more information about the match.
The special symbol .
encodes "any character". So you can see that GC.
matches GCA
:
sequence = "ATGCATGC"
pattern = "GC."
match = re.search(pattern, sequence)
if match != None:
print("first match:", match.group(), match.start(), match.end())
else:
print("no match")
[...]
means "exactly one of the characters between the brackets":
print(re.search("GC[ATG]", "GCAG") != None)
print(re.search("GC[ATG]", "GCTG") != None)
print(re.search("GC[ATG]", "GCCG") != None)
To find all occurences you can use finditer
:
for match in re.finditer("GC[AT]", "GCAGAAGCTGCC"):
print(match.group(), match.start(), match.end())
for match in re.finditer("GC.", "GCAGAAGCTGCC"):
print(match.group(), match.start(), match.end())
The expression |
means "either or":
print(re.search("A(bcd|BCD)E", "ABCDE") != None)
print(re.search("A(bcd|BCD)E", "AbcdE") != None)
print(re.search("A(bcd|BCD)E", "ABcdE") != None)
More about using regular experessions in biology: http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/regular-expressions/
range
¶range
accepts zero to two arguments:
range(n)
counts from 0
to n - 1
.range(m, n)
counts from m
to n - 1
range(m, n, k)
counts from m
to n - 1
with step size k
for i in range(2, 11, 3):
print(i)
for i in range(6, 3, -1):
print(i)
We now combine slicing and range
with three arguments to split a sequence into codons:
def split_codons(sequence):
codons = []
for start in range(0, len(sequence), 3):
codon = sequence[start:start + 3]
codons.append(codon)
return codons
print(split_codons("ATGCATGCA"))
Functions may compute more than one result. In this case you have to list the values separated by ,
after return
:
def sum_and_diff(a, b):
return a + b, a - b
x, y = sum_and_diff(10, 3)
print(x, y)
You can see above that we have to use x, y = sum_and_diff(10, 3)
to receive both return values in given order. x
will be the value of a + b
and y
will be a - b
.
Python provides some shortcuts for common algebraic operations:
x += m
is the same as x = x + m
x -= m
is the same as x = x - m
x *= m
is the same as x = x * m
x /= m
is the same as x = x / m
Here is an example which combines algebraic updates and multiple return values:
def statistics(li):
count = 0
acc = 0
for element in li:
count += 1
acc += element
return count, acc, acc / count
count, sum, average = statistics([1, 2, 3, 4, 5, 6])
print("counted", count, "numbers having sum", sum, "and average", average)
print
¶We already introduced the form print(..., file=fh)
to write to a file instead displaying the output on the console. Beyond that print
has other (so called) named arguments.
Before we introduce other named arguments, we state:
print
will automatically output a \n
after execution, so the next print
will display output on a new lineprint()
creates an empty lineprint
with mutliple arguments will separate them by a single space " "
:You can observe this in the following example:
print(3)
print()
print(1, 2, 3)
To modify some of these properities the named parameters sep
and end
come into play. sep
allows other separators than a single blank character:
print(1, 2, 3, sep=", ")
And end
allows to override the default \n
to avoid the line break:
print(1, 2, 3, end=" ")
print(4)
print(1, 2, 3, end="")
print(4)
We can use this to print a multiplication table:
def print_table(up_to):
for row in range(1, up_to + 1):
for col in range(1, up_to + 1):
cell_value = row * col
print(cell_value, end=" ")
print()
print_table(9)
The table looks a bit ugly, so we add a leading space for numbers with one digit:
def print_pretty_table(up_to):
for row in range(1, up_to + 1):
for col in range(1, up_to + 1):
cell_value = row * col
if cell_value < 10:
print(" ", end="")
print(cell_value, end=" ")
print()
print_pretty_table(9)
The expression you see between the brackets of print
is called "string interpolation":
def greet(name):
print("hi %s how do you do" % name)
greet("bart simpson")
Explanation: The %s
is a place holder, and when Python evaluates "hi %s how do you do" % name
the value of name
will be inserted at this position.
For multiple place holders in the template you have to use round brackets after %
, place holders are filled with the given values in order:
def print_sum(a, b):
print("%s plus %s is %s" % (a, b, a + b))
print_sum(3, 4)
%s
is one of many possible place holders.
Another place holder has the form %.nf
where n
is a number and f
is fixed, formats a number with n
digits after the decimal point:
import math
# print only two numbers after decimal point
print("pi with two digits is %.2f" % math.pi)
print("%.1f" % 3)
The form %m.nf
formats with n
numbers after the decimal point and pads spaces from the left so that the full result is at least m
characters wide:
print("%5.2f" % math.pi)
print("%5.2f" % 11.3)
print("%5.2f" % 121.3)
For integer numbers you can use %nd
for padding up to n
characters:
print("%3d" % 4)
print("%3d" % 44)
And e
and variations print in scientific notation:
print("%e" % math.pi)
print("%.2e" % math.pi)
print("%10.2e" % math.pi)
print_pretty_table
without the if
by using string interpolation instead ?Write a function print_triangle(n)
which displays a symmetric triangle of height n
. Assume that n
is an odd number. So print_triangle(5)
displays
*
***
*****
*******
*********
For common operations on lists Python provides some convenience functions. They replace some computations we exercised up to now.
To compute the maximum or minimum of a given list max
and min
functions exist:
print(max([2, 3, 1]))
print(min([2, 3, 1]))
Beyond that sorted
computes a sorted list from a given list:
print(sorted([3, 1, 2]))
Many tasks with lists as transforming or filtering a list can be expressed by so called list comprehensions. Here are a few examples:
numbers = range(10)
squared_numbers = [n ** 2 for n in numbers]
print(squared_numbers)
This is a shortcut for:
numbers = range(10)
squared_numbers = []
for n in numbers:
squared_numbers.append(n ** 2)
print(squared_numbers)
A more complex list comprehension is:
numbers = range(10)
squared_even_numbers = [n ** 2 for n in numbers if n % 2 == 0]
print(squared_even_numbers)
This is an shortcut for:
numbers = range(10)
squared_even_numbers = []
for n in numbers:
if n % 2 == 0:
squared_even_numbers.append(n ** 2)
print(squared_even_numbers)
Sometimes we want to iterate over two iterables at the same time. This can be implemented with the zip
function:
values = [1, 2, 1, 2, 1, 2]
groups = [0, 0, 0, 1, 1, 1]
for v, g in zip(values, groups):
print("value", v, "is in group", g)
When "zipping" the shorter iterable determines when looping ends:
for i, j in zip(range(4), range(2)):
print(i)
To retrieve an iteraton index enumerate
helps:
with open("amino_acids.csv", "r") as fh:
for i, line in enumerate(fh):
print("line", i, "has length", len(line.rstrip()))
if i >= 10:
break
zip
to avoid indexed access with []
.10, 10, 9, 9, 8
you can write the top 4 sequences from this list. Hint: create a list with the lengths of the sequences, sort it and find so a threshold value for filtering the given sequences.The next section demonstrates how to plot diagrams with Python. The used library http://matplotlib.org/ is comprehensive, we just can show a few examples here. Look at the gallery http://matplotlib.org/gallery.html to see what this library offers.
import pylab
worksImportError
exception read in 02_introduction_pycharm
in section "Install extra libraries" how to install the missing libraryimport pylab
import random
x_values = []
for i in range(1000):
random_number = random.normalvariate(0, 100.0) # mean 0.0 std deviation 100.0
x_values.append(random_number)
"""
play with the parameters "bins" and "color" !!!
"""
pylab.hist(x_values, bins=20, color="b")
pylab.show()
import pylab
import random
x_values = []
y_values = []
for i in range(1000):
x = random.normalvariate(0, 100.0) # mean 0.0 std deviation 100.0
y = x + random.normalvariate(0.0, 30.0)
x_values.append(x)
y_values.append(y)
"""
what happens if you use "b*" instead of "r." ??
"""
pylab.plot(x_values, y_values, "r.") # red dots
pylab.show()
Now we group four plots:
# set the overall size of the next plot in (width, height) inches
pylab.figure(figsize=(18, 4))
pylab.subplot(1, 4, 1) # one row, four columns, plot number 1
pylab.plot(x_values, 'y')
pylab.subplot(1, 4, 2)
pylab.hist(x_values, bins=20, color="r")
pylab.subplot(1, 4, 3)
pylab.hist(y_values, bins=20, color="g")
pylab.subplot(1, 4, 4)
pylab.plot(x_values, y_values, "b.")
pylab.show()
"""suggestions for playing:
- arange the plots in two rows and two columns
- play with the figsize paramter above
- choose "random_hist.pdf" as a file name below !
"""
# we can save this plot to disk, you should find the file in PyCharms project folder afterwards.
# If you want to save only, you can ommit the "pylab.show()" above !
pylab.savefig("random_hist.png")
numpy
provides data containers for vectors and matrices:
import numpy
x = numpy.arange(0.0, 6.0, 0.25) # similar to range: arguments are start, excluive upper limit, stepsize
print(repr(x))
numpy
vectors are more handy the usual Python lists, so we can apply a single numpy function elementwise to a vector like this:
# to apply a function to all elements of a vector:
y = numpy.sin(x)
(For large vectors this is much faster than performing the same operation with a for
loop over lists)
print(y) # without repr it prints as a regular list, but is still is an numpy array !
y2 = numpy.sin(.75 * x)
pylab.plot(x, y, label="sine")
pylab.plot(x, y2, label="slower sine")
pylab.title("two sines")
pylab.legend()
pylab.show()
import numpy
import pylab
# we use the next statement to adapt the output format if we print numpy data)
numpy.set_printoptions(precision=4, linewidth=100)
# we create a 10 x 10 matrix with random values in range 0.0 to 1.0:
matrix = numpy.random.random((10,10))
print(matrix)
pylab.figure(figsize=(5, 5)) # so we enforce equal sized axes, remove this line and see the difference !
pylab.pcolor(matrix, cmap=pylab.cm.hot)
# play with other colormaps ! (see http://matplotlib.org/examples/color/colormaps_reference.html)
pylab.show()
Try to identify the colors with cell numbers above, what do you observe ???
Reproduce all examples above, play with them !
Plot two histograms in a row: a histogram of the sequence lengths in a given the FASTA file and another histrogram of the AG content of the sequences.