from IPython.core.display import HTML
HTML(open("custom.html", "r").read())
So called regular expressions describe patterns which can be used to search strings.
Such a pattern is a string, but some characters as .[]?+*()^ have special meanings.
Python has the re module to work with regular expressions, the function re.search looks for the first occurence of the pattern in the given string:
import re
sequence = "ATGCATGC"
pattern = "GCA" # pattern withouth special character !
match = re.search(pattern, sequence)
if match != None:
print("first match:", match.group(), match.start(), match.end())
else:
print("no match")
re.search returns either None or a "match object". As you can see in the example above this object has some methods to give you more information about the match.
The special symbol . encodes "any character". So you can see that GC. matches GCA:
sequence = "ATGCATGC"
pattern = "GC."
match = re.search(pattern, sequence)
if match != None:
print("first match:", match.group(), match.start(), match.end())
else:
print("no match")
[...] means "exactly one of the characters between the brackets":
print(re.search("GC[ATG]", "GCAG") != None)
print(re.search("GC[ATG]", "GCTG") != None)
print(re.search("GC[ATG]", "GCCG") != None)
To find all occurences you can use finditer:
for match in re.finditer("GC[AT]", "GCAGAAGCTGCC"):
print(match.group(), match.start(), match.end())
for match in re.finditer("GC.", "GCAGAAGCTGCC"):
print(match.group(), match.start(), match.end())
The expression | means "either or":
print(re.search("A(bcd|BCD)E", "ABCDE") != None)
print(re.search("A(bcd|BCD)E", "AbcdE") != None)
print(re.search("A(bcd|BCD)E", "ABcdE") != None)
More about using regular experessions in biology: http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/regular-expressions/
range¶range accepts zero to two arguments:
range(n) counts from 0 to n - 1.range(m, n) counts from m to n - 1range(m, n, k) counts from m to n - 1 with step size k for i in range(2, 11, 3):
print(i)
for i in range(6, 3, -1):
print(i)
We now combine slicing and range with three arguments to split a sequence into codons:
def split_codons(sequence):
codons = []
for start in range(0, len(sequence), 3):
codon = sequence[start:start + 3]
codons.append(codon)
return codons
print(split_codons("ATGCATGCA"))
Functions may compute more than one result. In this case you have to list the values separated by , after return:
def sum_and_diff(a, b):
return a + b, a - b
x, y = sum_and_diff(10, 3)
print(x, y)
You can see above that we have to use x, y = sum_and_diff(10, 3) to receive both return values in given order. x will be the value of a + b and y will be a - b.
Python provides some shortcuts for common algebraic operations:
x += m is the same as x = x + mx -= m is the same as x = x - mx *= m is the same as x = x * mx /= m is the same as x = x / mHere is an example which combines algebraic updates and multiple return values:
def statistics(li):
count = 0
acc = 0
for element in li:
count += 1
acc += element
return count, acc, acc / count
count, sum, average = statistics([1, 2, 3, 4, 5, 6])
print("counted", count, "numbers having sum", sum, "and average", average)
print¶We already introduced the form print(..., file=fh) to write to a file instead displaying the output on the console. Beyond that print has other (so called) named arguments.
Before we introduce other named arguments, we state:
print will automatically output a \n after execution, so the next print will display output on a new lineprint() creates an empty lineprint with mutliple arguments will separate them by a single space " ":You can observe this in the following example:
print(3)
print()
print(1, 2, 3)
To modify some of these properities the named parameters sep and end come into play. sep allows other separators than a single blank character:
print(1, 2, 3, sep=", ")
And end allows to override the default \n to avoid the line break:
print(1, 2, 3, end=" ")
print(4)
print(1, 2, 3, end="")
print(4)
We can use this to print a multiplication table:
def print_table(up_to):
for row in range(1, up_to + 1):
for col in range(1, up_to + 1):
cell_value = row * col
print(cell_value, end=" ")
print()
print_table(9)
The table looks a bit ugly, so we add a leading space for numbers with one digit:
def print_pretty_table(up_to):
for row in range(1, up_to + 1):
for col in range(1, up_to + 1):
cell_value = row * col
if cell_value < 10:
print(" ", end="")
print(cell_value, end=" ")
print()
print_pretty_table(9)
The expression you see between the brackets of print is called "string interpolation":
def greet(name):
print("hi %s how do you do" % name)
greet("bart simpson")
Explanation: The %s is a place holder, and when Python evaluates "hi %s how do you do" % name the value of name will be inserted at this position.
For multiple place holders in the template you have to use round brackets after %, place holders are filled with the given values in order:
def print_sum(a, b):
print("%s plus %s is %s" % (a, b, a + b))
print_sum(3, 4)
%s is one of many possible place holders.
Another place holder has the form %.nf where n is a number and f is fixed, formats a number with n digits after the decimal point:
import math
# print only two numbers after decimal point
print("pi with two digits is %.2f" % math.pi)
print("%.1f" % 3)
The form %m.nf formats with n numbers after the decimal point and pads spaces from the left so that the full result is at least m characters wide:
print("%5.2f" % math.pi)
print("%5.2f" % 11.3)
print("%5.2f" % 121.3)
For integer numbers you can use %nd for padding up to n characters:
print("%3d" % 4)
print("%3d" % 44)
And e and variations print in scientific notation:
print("%e" % math.pi)
print("%.2e" % math.pi)
print("%10.2e" % math.pi)
print_pretty_table without the if by using string interpolation instead ?Write a function print_triangle(n) which displays a symmetric triangle of height n. Assume that n is an odd number. So print_triangle(5) displays
*
***
*****
*******
*********
For common operations on lists Python provides some convenience functions. They replace some computations we exercised up to now.
To compute the maximum or minimum of a given list max and min functions exist:
print(max([2, 3, 1]))
print(min([2, 3, 1]))
Beyond that sorted computes a sorted list from a given list:
print(sorted([3, 1, 2]))
Many tasks with lists as transforming or filtering a list can be expressed by so called list comprehensions. Here are a few examples:
numbers = range(10)
squared_numbers = [n ** 2 for n in numbers]
print(squared_numbers)
This is a shortcut for:
numbers = range(10)
squared_numbers = []
for n in numbers:
squared_numbers.append(n ** 2)
print(squared_numbers)
A more complex list comprehension is:
numbers = range(10)
squared_even_numbers = [n ** 2 for n in numbers if n % 2 == 0]
print(squared_even_numbers)
This is an shortcut for:
numbers = range(10)
squared_even_numbers = []
for n in numbers:
if n % 2 == 0:
squared_even_numbers.append(n ** 2)
print(squared_even_numbers)
Sometimes we want to iterate over two iterables at the same time. This can be implemented with the zip function:
values = [1, 2, 1, 2, 1, 2]
groups = [0, 0, 0, 1, 1, 1]
for v, g in zip(values, groups):
print("value", v, "is in group", g)
When "zipping" the shorter iterable determines when looping ends:
for i, j in zip(range(4), range(2)):
print(i)
To retrieve an iteraton index enumerate helps:
with open("amino_acids.csv", "r") as fh:
for i, line in enumerate(fh):
print("line", i, "has length", len(line.rstrip()))
if i >= 10:
break
zip to avoid indexed access with [].10, 10, 9, 9, 8 you can write the top 4 sequences from this list. Hint: create a list with the lengths of the sequences, sort it and find so a threshold value for filtering the given sequences.The next section demonstrates how to plot diagrams with Python. The used library http://matplotlib.org/ is comprehensive, we just can show a few examples here. Look at the gallery http://matplotlib.org/gallery.html to see what this library offers.
import pylab worksImportError exception read in 02_introduction_pycharm in section "Install extra libraries" how to install the missing libraryimport pylab
import random
x_values = []
for i in range(1000):
random_number = random.normalvariate(0, 100.0) # mean 0.0 std deviation 100.0
x_values.append(random_number)
"""
play with the parameters "bins" and "color" !!!
"""
pylab.hist(x_values, bins=20, color="b")
pylab.show()
import pylab
import random
x_values = []
y_values = []
for i in range(1000):
x = random.normalvariate(0, 100.0) # mean 0.0 std deviation 100.0
y = x + random.normalvariate(0.0, 30.0)
x_values.append(x)
y_values.append(y)
"""
what happens if you use "b*" instead of "r." ??
"""
pylab.plot(x_values, y_values, "r.") # red dots
pylab.show()
Now we group four plots:
# set the overall size of the next plot in (width, height) inches
pylab.figure(figsize=(18, 4))
pylab.subplot(1, 4, 1) # one row, four columns, plot number 1
pylab.plot(x_values, 'y')
pylab.subplot(1, 4, 2)
pylab.hist(x_values, bins=20, color="r")
pylab.subplot(1, 4, 3)
pylab.hist(y_values, bins=20, color="g")
pylab.subplot(1, 4, 4)
pylab.plot(x_values, y_values, "b.")
pylab.show()
"""suggestions for playing:
- arange the plots in two rows and two columns
- play with the figsize paramter above
- choose "random_hist.pdf" as a file name below !
"""
# we can save this plot to disk, you should find the file in PyCharms project folder afterwards.
# If you want to save only, you can ommit the "pylab.show()" above !
pylab.savefig("random_hist.png")
numpy provides data containers for vectors and matrices:
import numpy
x = numpy.arange(0.0, 6.0, 0.25) # similar to range: arguments are start, excluive upper limit, stepsize
print(repr(x))
numpy vectors are more handy the usual Python lists, so we can apply a single numpy function elementwise to a vector like this:
# to apply a function to all elements of a vector:
y = numpy.sin(x)
(For large vectors this is much faster than performing the same operation with a for loop over lists)
print(y) # without repr it prints as a regular list, but is still is an numpy array !
y2 = numpy.sin(.75 * x)
pylab.plot(x, y, label="sine")
pylab.plot(x, y2, label="slower sine")
pylab.title("two sines")
pylab.legend()
pylab.show()
import numpy
import pylab
# we use the next statement to adapt the output format if we print numpy data)
numpy.set_printoptions(precision=4, linewidth=100)
# we create a 10 x 10 matrix with random values in range 0.0 to 1.0:
matrix = numpy.random.random((10,10))
print(matrix)
pylab.figure(figsize=(5, 5)) # so we enforce equal sized axes, remove this line and see the difference !
pylab.pcolor(matrix, cmap=pylab.cm.hot)
# play with other colormaps ! (see http://matplotlib.org/examples/color/colormaps_reference.html)
pylab.show()
Try to identify the colors with cell numbers above, what do you observe ???
Reproduce all examples above, play with them !
Plot two histograms in a row: a histogram of the sequence lengths in a given the FASTA file and another histrogram of the AG content of the sequences.