from IPython.core.display import HTML

HTML(open("custom.html", "r").read())

Script 5

Recap from script 4

We learned:

  • looping with for .. in range(..):
  • to nest loops and branching with if et al
  • looping with while
  • breaking out of loops with break
  • infinite while loops

Recap about strings

We introduced strings in the second script. Here is a short summary about strings:

  • strings are another data type in Python (as integer and floating point numbers)
  • strings represent text (more exact: sequences of characters)
  • in order to distinguish program code and string data we have to delimit strings with "
  • we can compute the length of a string using the len function
  • if we add strings with + they are concatenated
  • the number 123 and the string "123" are different things although they result in the same output when printed
  • the input function always returns a string, so we have to use type conversion if we ask the user for numbers which are used in subsequent numerical computations
  • we can access single characters of a string using [].

In case you do not understand all points of the list first repeat the introductions in the second script !

More about strings

When we said that strings are delimited by double quotes " we only introduced a part of the truth. We can choose other delimiters which are ' (single quote), """ (three double quotes) and ''' (three single quotes).

The only restriction is that we use the same delimiter on both ends of the string.

This is handy as " itself is a character and thus may occur as a part of the string.

print('here we have a " in a string')
here we have a " in a string

Using a " as delimiter in this example would confuse Python because the " inside the string would be interpreted as a delimiter and the following in a string") as Python code which is syntactically incorrect:

print("here we have a " in a string")
  File "<ipython-input-3-d50e57f93b06>", line 1
    print("here we have a " in a string")
                                      ^
SyntaxError: invalid syntax

This works the other way round too:

print("here we have a ' in a string")
here we have a ' in a string

There are some "special" characters, which are encoded using the so called escape character \. So the two characters \n result in a line break when printed:

print("hi\nyou")
hi
you

Although we type two characters, \n is interpreted as a single character, the so called new line character:

print(len("hi\nyou"))
6

To "see" what a string really "contains", we can use the repr function:

a = "line 1\nline 2"
print(a)
print(repr(a))
line 1
line 2
'line 1\nline 2'

\n is the most used special character and is the only one we will face in this course.

The short excursion to special characters helps to explain how the delimiters """ and ''' work: In contrast to the one character delimiters they can delimit strings over multiple lines !

sequence = """GCA ATC GCT TTA GGA CCT
              GCA ATC GCT TTA GGA CCT"""

print(repr(sequence))
'GCA ATC GCT TTA GGA CCT\n              GCA ATC GCT TTA GGA CCT'

If you look at the output you can see the line break followed by some spaces.

The following snippet is valid Python code with fewer spaces in the multi line string:

sequence = """GCA ATC GCT TTA GGA CCT
GCA ATC GCT TTA GGA CCT"""

print(repr(sequence))
'GCA ATC GCT TTA GGA CCT\nGCA ATC GCT TTA GGA CCT'

Comparing strings

We can compare strings the same way as we did it for numbers:

print("abc" == "ABC")
False
print("abc" != "ABC")
True

Using < for strings works as well. We consider one string to be smaller than another string if we the first string would appear before the second string in a phone book:

print("abcde" < "abcdfg")
True

This is called lexicographical or phone book ordering.

In this system capital letters are smaller than lower case letters:

print("ABC" <"abc")
True

Other comparison operators work the same way:

print("abc" >= "ABC")
True

Exercise 1

  • Type the examples above and run them. Again: match the code examples and the displayed output !

String methods

Attention: Before we go on make sure that you know what argument(s) of a function and return value mean, if not repeat the according section in the second script !

When we introduced the import statement we said that the imported functions and values are attributes of the module. So cos and pi are attributes of the module math:

import math
print(math.cos(math.pi))
-1.0

In general when we see expressions like x.y in Python we say y is an attribute of x.

Not only modules have attributes, most data types in Python have attributes as well ! So the str type has attributes which can be used like functions. This kind of attribute is called method.

For example the count method takes one argument and returns an integer value:

sequence = "GCA ATC GCT TTA GGA CCT"
print(sequence.count("G"))
4

Here we counted the number of occurrences of "G" in sequence.

Another example is the upper method which has zero arguments and computes a new string where all alphabetical characters are converted to upper case letters:

x = "Hi You !"
y = x.upper()
print(y)
HI YOU !

If you call a function or method with zero arguments you still have to use () to call the function resp. method as we did in the previous example. If you forget this you Python behaves as follows:

print("hi".upper)
<built-in method upper of str object at 0x1021e3618>

This tells you that upper is a method of the string "hi" but does not execute the method because we forgot to append () for calling this method.

Another helpful method is replace which takes two arguments and computes a new string:

sequence = "GCA ATC GCT TTA GGA CCT"
print(sequence.replace(" ", "-"))
GCA-ATC-GCT-TTA-GGA-CCT

Here all occurrences of spaces were replaced by -. This can be used to delete certain characters. In the following code snippet we replaces spaces by empty strings:

sequence = "GCA ATC GCT TTA GGA CCT"
print(sequence.replace(" ", ""))
GCAATCGCTTTAGGACCT

We can call methods directly on strings:

print("Hi".upper())
HI

And we can assign the results of a method call to variables as usual:

greeting = "Hi You !"
upper_greeting = greeting.upper()
print(upper_greeting)
HI YOU !

Further we can chain an arbitrary number of method calls, which are executed in listed order:

print("abcA".upper().count("A"))
2

The evaluation in the previous example is as follows:

  1. "abcA".upper() evaluates to "ABCA" which is a new (intermediate) string
  2. the count("A") method of this intermediate string is called which evaluates to 2.

Another example is to "clean up" multi line strings:

sequence = """GCA ATC GCT TTA GGA CCT
              GCA ATC GCT TTA GGA CCT"""

short_sequence = sequence.replace("\n", "").replace(" ", "")
print(short_sequence)
GCAATCGCTTTAGGACCTGCAATCGCTTTAGGACCT

Exercise block 2

  1. Reproduce the examples above.
  2. Write a program which asks the user for some text and then detects if all given characters are upper case. (Hint: a character is upper case if it stays the same when transformed to upper case).
  3. Rewrite the computation of GC content we did before so that lower case inputs are handled just as their upper case equivalent and spaces are ignored. The solution should not contain a for loop anymore.
  4. Write a program which asks the user for an nucleotide sequence and checks if the sequence only contains valid symbols T, C, A and G. Finally the program prints an appropriate message. (Hint: count the number of A, T, G, and C symbols in the given sequence. For a correct sequence the sum of these counts is the same as the length of the sequence)
  5. Extend 3: Asks the user for an nucleotide sequence and prints the relative GC content. If the user input is invalid (as implemented in the preceding exercise) first an appropriate message should be printed and then user is asked again. (Tip: infinite loop)

More about string indexing

Remember: In Python we can access single characters of a string using square brackets. The notation [i] where i is zero or a positive integer number (called index) extracts the character at position i.

Indexing starts with zero, so the first character is accessed with [0], the second character with [1] and so on !

print("abc"[0])
a
name = "uwe"
print(name[1])
w

Negative indices count from the end of the string:

print("uwe"[-1])
print("uwe"[-2])
print("uwe"[-3])
e
w
u

You must not use the bracket notation to replace a given character:

seq = "TGCAG"
seq[2] = "?"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-3e828846b3b2> in <module>()
      1 seq = "TGCAG"
----> 2 seq[2] = "?"

TypeError: 'str' object does not support item assignment

To solve this we need so called slicing (like "slicing bread).

The general form of slicing is [n:m] which computes a substring starting and index n up to m (exclusive !):

print("012345"[1:4])
123
print("012345"[2:-1])
234

Python knows to abbreviations [:n] and [m:]. The first one starts at the beginning, the second one goes until the end:

print("012345"[:3])
012
print("012345"[4:])
45

We can use this to replace a character of a given string by computing a new string:

seq = "TGCAG"
seq_new = seq[:2] + "?" + seq[3:]
print(seq)
print(seq_new)
TGCAG
TG?AG

The following program checks if a given string is a palindrome (so if it reads the same forwards and backwards):

txt = "racecar"

found_invalid_pair = False

for i in range(len(txt)):
    i_back = len(txt) - i - 1
    if txt[i] != txt[i_back]:
        found_invalid_pair = True
        break

if found_invalid_pair:
    print(txt, "is not a palindrome")
else:
    print(txt, "is a palindrome")
racecar is a palindrome

Check question

What does the following statement display ? First use pen and paper then use Python to check your result:

text = "abcdefghijk"
print(text[:2] + text[3:4] < text[0:2] + text[3:len(text)].upper())
False

Exercise block 3

  1. Repeat the examples above

  2. Try to understand the palindrome check. It helps to simulate the computer using pen and paper and running the palindrome check for inputs "ABCBA" and "ABCDA".

  3. Why does the program still works without the break?

  4. Implement an alternative solution by first computing the reverse of the given string, then use == to check if both are the same.

  5. Use a for loop to simulate the count method: The user provides a string and your program counts the number of spaces in the string. (You need a variable for counting spaces. Initialize it with 0 and increment it for every hit).

  6. Write a program which asks the user for a valid nucleotide sequence and prints all positions of G followed by C. The output for input AGCCCGCAGC should be similar to

    found GC starting at position 1
    found GC starting at position 5
    found GC starting at position 8
    
    

    Hint: you have to check for every position if the character at that position is G and if the following character is C. Use a for loop but pay attention with the upper limit of the range function !

  7. (optional) Write Python code which computes the reverse complement of a DNA sequence. Hint: you need if and friends. (Lookup the definition of "reverse complement" if you don't know what this means).

Repetition exercise

  • Implement the number guessing game (only one round is played)
  • Extend it to play until the user does not want to continue.
  • Determine the starting value which produces the longest collatz sequence for starting values in 1 ... 100. (The collatz update rule for a given n was to compute n//2 for even numbers and 3 * n + 1 for odd numbers until we reach the sentinel 1).

(Optional chapter): ASCII codes and simple encryption

Internally the computer stores characters as numbers in the range 0 to 255. So A is stored as 65 and B as 66. The numbers are called ASCII code: (ASCII is an acronym for "American Standard Code for Information Interchange")

  32    |    44 ,  |    56 8  |    68 D  |    80 P  |    92 \  |   104 h  |   116 t  
  33 !  |    45 -  |    57 9  |    69 E  |    81 Q  |    93 ]  |   105 i  |   117 u  
  34 "  |    46 .  |    58 :  |    70 F  |    82 R  |    94 ^  |   106 j  |   118 v  
  35 #  |    47 /  |    59 ;  |    71 G  |    83 S  |    95 _  |   107 k  |   119 w  
  36 $  |    48 0  |    60 <  |    72 H  |    84 T  |    96 `  |   108 l  |   120 x  
  37 %  |    49 1  |    61 =  |    73 I  |    85 U  |    97 a  |   109 m  |   121 y  
  38 &  |    50 2  |    62 >  |    74 J  |    86 V  |    98 b  |   110 n  |   122 z  
  39 '  |    51 3  |    63 ?  |    75 K  |    87 W  |    99 c  |   111 o  |   123 {  
  40 (  |    52 4  |    64 @  |    76 L  |    88 X  |   100 d  |   112 p  |   124 |  
  41 )  |    53 5  |    65 A  |    77 M  |    89 Y  |   101 e  |   113 q  |   125 }  
  42 *  |    54 6  |    66 B  |    78 N  |    90 Z  |   102 f  |   114 r  |   126 ~  
  43 +  |    55 7  |    67 C  |    79 O  |    91 [  |   103 g  |   115 s  |   127    

Comment: You remember how strings are ordered in Python ? A is considered to be smaller as a because the corresponding number codes have this ordering !

Python provides two functions for conversion from number to character and vice versa:

ord computes the ASCII code from a given character:

print(ord("A"))
65

And chr computes a character from a given code:

print(chr(66))
B

This can be used to transform strings. The following example transforms letters in a string to their uppercase equivalents:

name = "uwe"

new_name = ""

for i in range(len(name)):
    c = name[i]
    code = ord(c)
    if 97 <= code <= 122:
        code = code - 32
    new_name = new_name + chr(code)
    
print(new_name)
UWE

This is an educational example, usually we would use the .upper method we learned already.

Again we see the pattern: we start with an empty string and assemble the result during the iterations.

Optional exercise block 4

  1. Repeat the example above
  2. Modify it to do the inverse transformation from upper case to lower case characters.
  3. Implement the so called rot-1 transform, which takes a string and transforms A to B, B to C, ... Z to A. So the rot+1 transformation of BABEL is CBCFM. This is a very simple encryption method. Hint: First write a script transforming a single character, you need to handle the shift Z to A separately.
  4. Implement the inverse transform.
  5. Implement the rot+13 transformation (shift ASCII codes by 13) and its inverse. (Check: rot+13 encryption of ANMZ is NAZM). Comment: those rot-n transformations are also called Caesar cipher (https://en.wikipedia.org/wiki/Caesar_cipher)

The find method of strings

The find method looks for occurrences of a given string in another string:

print("abcdef".find("cde"))
2

So the string "cde" appears in "abcdef" at index 2. For multiple matches only the first occurrence is computed.

print("abcdefabcdef".find("def"))
3

The value -1 indicates a missing match:

print("abcdef".find("xzy"))
-1

To find all occurrences we make use of an extra feature of find: we can provide the starting position to look for a match:

print("abcdefabcdef".find("def", 4))
9
sequence = "GCTGGCAGTCATGCCAACGGGCATGC"
pattern = "GC"

position = sequence.find(pattern)
while position > -1:
    print(position)
    position = sequence.find(pattern, position + 1)
0
4
12
20
24

Exercise block 5

  1. Repeat and type the examples above
  2. Try to understand how the last example works
  3. Rewrite the last example to use an infinite while loop instead. The following sketch might help:

    while True:
       # look for occurrence at a given starting position
       # if not found: stop looping
       # else: report match and update starting position

Most useful string methods

  • .count(substring) counts non overlapping occurrences of substring
  • .replace(a_string, b_string) replaces all occurrences of a_string by b_string
  • .lower() and .upper() convert characters to upper resp. lower case
  • .strip() removes all white-spaces (space, tab and new line characters) from both ends of the string
  • .strip(characters) removes all single characters occurring in characters from both ends of the string.
  • .lstrip() as .strip() but only from the beginning of the string
  • .rstrip() as .strip() but only from the end of the string
  • .startswith(txt) checks if the given strings starts with txt
  • .endswith(txt) checks if the given string ends with txt.

Examples

"ABABAB".count("AB")
3
"ABABABA".replace("AB", "x")
'xxxA'
"abAB".lower()
'abab'
"abAB".upper()
'ABAB'
" abcd cde\n ".strip()
'abcd cde'
" abcd cde\n ".lstrip()
'abcd cde\n '
"ABCAxBCBA".rstrip("ABC")
'ABCAx'
"my name".startswith("my")
True
"my name".endswith("uwe")
False