from IPython.core.display import HTML

HTML(open("custom.html", "r").read())

Script 6

Recap from script 5

  • Python provides four alternative to string delimiters
  • The special character \n is called "line break" character and creates a line break when printed.
  • The repr function helps us to examine the exact content of a string.
  • We introduced some string methods: .count, .upper and .replace.
  • We learned about using negative indexes and slices to access single characters or parts of a string.

Recap "attributes"

  • When we see expressions as x.y we say y is an attribute of x.
  • After import math we can use sin or pi as attributes of math.
  • String methods are attributes of of strings (eg "hi".upper())
import math
print(math.sin(math.e))
0.4107812905029088
print("hey joe".replace(" ", "-"))
hey-joe

About Files

You can imagine a file on disk as a string: All files are sequences of single symbols. Even complex files as word documents consist of a sequence of single characters.

If we want to access a file we first have to "open" it. The open function accepts two string arguments: the first one is the name of the file, the second is the so called "access mode":

  • the access mode "r" opens a file for reading,
  • "w" opens a file for writing. If the file already exists it is first deleted !
  • "a" opens a file for appending. If the file does not exist yet it is created. If it exists writing to the file will append new content at the end. (we will not use "a" in the exercises)

The return value of open is a so called file handle. A file handle

  • is a data type as strings, numbers or logical values.
  • serves as a place holder to operate on the file.
  • operations on files are provided as attributes (methods) of the file handles.
  • keeps internal data structures for handling the files
fh = open("test.txt", "w")
print(type(fh))

fh.write("hi")
fh.write("you")
fh.close()
<class '_io.TextIOWrapper'>

Exercise 1

Type and run the previous example, we explain the details later. Finally you should see a new file in PyCharm next to your script. You can open it with PyCharm with a double mouse click.

Explanations:

  • in the output above we see the type of the variable fh
  • we call the method write to write a string to the file
  • we call close to finalize our operations on the file.

fh is just a proper variable name, you might choose other names as you like.

To close a file is important: If you forget to close a file the content might be damaged.

If you close a file further operations (as another call of write) are not allowed:

fh = open("test.txt", "w")
fh.write("hi")
fh.close()
fh.write("you")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-33948b811b05> in <module>()
      2 fh.write("hi")
      3 fh.close()
----> 4 fh.write("you")

ValueError: I/O operation on closed file.

To read the full content of a file we use the read method:

fh_in = open("test.txt", "r")
content = fh_in.read()
print(content)
fh_in.close()
hi

A more convenient way to write to a file is a variant of print. The extra argument file=fh in the following example redirects the output to the given file:

fh = open("numbers.txt", "w")
for number in range(1, 6):
    print(number, file=fh)
fh.close()

The syntax file= is fixed and must appear at the end of print(...). The variable name fh can be arbitrary, but must refer to a file opened in writing mode.

This works for all variants of print, for example:

fh = open("square_numbers.txt", "w")
for number in range(1, 6):
    print(number, "squared is", number ** 2, file=fh)
fh.close()

Exercise block 2

  1. Repeat the examples above
  2. Use read and repr to examine the content of "numbers.txt".
  3. Copy a small Word or Excel file to your project folder, use open with mode "rb" (not introduced before) and then read to display the content of the file as a string.

Using for to read from a file

When we write for x in range(10) we say "we iterate over range(10)". Python is very flexible in this respect, and there are other objects we can iterate over.

So we can iterate over the characters of a string. Instead of

txt = "abc"
for i in range(len(txt)):
    print(txt[i])
a
b
c

we can write:

txt = "abc"
for char in txt:
    print(char)
a
b
c

Objects we can iterate over with for are called iterables. Beyond range and str objects the file handle we introduced above is another iterable !

If we loop over a file handle we iterate over the lines of a file:

fh = open("numbers.txt", "r")
for line in fh:
    print(line)
fh.close()
1

2

3

4

5

If you wonder why we see the empty lines in the output you can modify the snippet to use repr which provides details about the actual content of the lines:

fh = open("numbers.txt", "r")
for line in fh:
    print(repr(line))
fh.close()
'1\n'
'2\n'
'3\n'
'4\n'
'5\n'

So you can see that if we iterate over the lines in a file with for we get the full line including the line breaks ! We can get rid of trailing line breaks and spaces using the .rstrip method of strings:

fh = open("numbers.txt", "r")
for line in fh:
    print(line.rstrip())
fh.close()
1
2
3
4
5

Using print(..., file=...) instead of write has some advantages:

  • print allows to write different types to a file, write only accepts strings.
  • if you use write you have to include line break character \n, print does this for you

We introduced read and write for didactical reasons, in practice print and using the for base approach are more powerful and easier to use.

Exercise block 3

  1. Repeat the examples above
  2. Write a script which reads from "numbers.txt" and computes the sum of the given numbers.
  3. Download http://siscourses.ethz.ch/python_dbiol/data/short.fasta and save it in the folder of your Python scripts. Open it to see how the file is structured, then write a script which iterates over the file and displays only lines which start with >. Suppress empty lines in the output ! Lines starting with > are called status or description lines.

Repetition exercise

  • What is the result of "abcde".find("bc") and "abcde".find("gx")? Try to forecast the result before you check it with Python.
  • Use the find method + while to find all positions GC in the sequence GCTGGCAGTCATGCCAACGGGCATGC

Transforming files

We can work with several files at the same time. The following script iterates over the numbers in numbers.txt and writes the squares of the numbers to a new file:

fh_in = open("numbers.txt", "r")
fh_out = open("squared.txt", "w")

for line in fh_in:
    current_number = int(line.rstrip())
    print(current_number ** 2, file=fh_out)
    
fh_in.close()
fh_out.close()
            
fh = open("squared.txt", "r")
for line in fh:
    print(line.rstrip())
fh.close()
1
4
9
16
25

Explanations:

  • we open one file in reading mode, the other in writing mode
  • we use two different file handles fh_in and fh_out for this. As said it is up to you to choose meaningful and descriptive variable names.
  • then we read line by line from the input file and write the squared value to the new file we opened in writing mode
  • finally we close both files
  • to check the new file we open it again and display line by line

A more robust method to work with files

Since version 2.5 Python provides an alternative method to work with files and which prevents forgetting to close the file. The following snippet replaces the previous one with the new syntax:

with open("numbers.txt", "r") as fh:
    for line in fh:
        print(repr(line.rstrip()))
'1'
'2'
'3'
'4'
'5'

The with statement "protects" the following code block (here the block has two lines): As soon as the execution of the code block ends Python takes care to close the file automatically. This is why you do not see a fh.close() call anymore.

Using with is highly recommended . We introduced the other method for didactical reasons, and if you read other peoples code you still may find the outdated approach.

If we want to work with two open files at the same time we have to nest with statements: The first with protects the following four lines, the second with the following three lines:

with open("numbers.txt", "r") as fh_in:
    with open("squared.txt", "w") as fh_out:
        for line in fh_in:
            current_number = int(line.rstrip())
            print(current_number ** 2, file=fh_out) 

Another option is to chain multiple open after with separated by , like this:

with open("numbers.txt", "r") as fh_in, open("squared.txt", "w") as fh_out:
    for line in fh_in:
        current_number = int(line.rstrip())
        print(current_number ** 2, file=fh_out) 

A quick and dirty check of the result file:

print(open("squared.txt", "r").read())
1
4
9
16
25

Exercise block 4

  1. Repeat and reproduce the previous examples and explanations.
  2. Rewrite the solutions from previous exercises to use the with approach.
  3. Write a snippet which iterates over a FASTA file and writes a new file only containing the status lines.

A more advanced example

The FASTA file we downloaded in exercise 3 has the nice property that it contains a blank line after every sequence. This is not always the case for FASTA files, but helps us here to implement code which displays for every sequence the overall length of the sequence followed by the according status line.

Open the FASTA file and inspect it before you continue !

The strategy is as follows: We iterate over the lines using for and:

  • if we see a status-line we store this line in a variable and set the counter for the symbols to zero.
  • if we see a blank line we display the variable keeping the last status line and the computed count of symbols
  • all other lines are sequences, for these lines we only update the counter.

Open the FASTA file and read this strategy again !

with open("short.fasta", "r") as fh:
    for line in fh:
        line = line.rstrip()
        
        # line might be empty after rstrip, in this case line[0] would be an error:
        if len(line) > 0 and line[0] == ">":
            last_status = line
            count = 0
        
        elif line == "":
            print("symbol count:", count, "in", last_status)
        
        else:
            # the current line is neither a status line nor empty,
            # thus line must be part of the current sequence:
            count = count + len(line)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-5181c9d663d7> in <module>()
----> 1 with open("short.fasta", "r") as fh:
      2     for line in fh:
      3         line = line.rstrip()
      4 
      5         # line might be empty after rstrip, in this case line[0] would be an error:

FileNotFoundError: [Errno 2] No such file or directory: 'short.fasta'

The if, elif and else correspond one-to-one to the three items in the list above describing the strategy!

Exercise block 5

  1. Reproduce the script and try to understand it. (It helps to track the execution of the script for the first three sequences in the FASTA file, check it for the last sequence in the file as well, add print statements to show the values of the variables in every iteration).
  2. Modify it so that the results are not displayed but written to a separate csv file with two columns for counts and status lines.
  3. Additionally compute the average GC content of every sequence and write this to the csv file as an additional column.
  4. (optional) Rewrite the example script above so that it works even if there are no blank lines after the sequences ! First organize or create an appropriate but small FASTA file, then develop a strategy using pen and paper before you start to type Python code.

About file paths

If you want to access (read or write) a file at a different place than next to your Python script you will have to provide a so called path to open. This is a string describing the location of a file on your computer. You have to follow the file system hierarchy folder by folder as seen in the examples below.

Example: you want to write to a file data.txt to a sub-folder Documents in your home folder of your machine.

For Windows:

Usually you have to navigate from the top of drive C: to the folder Windows, then to Users and the to the folder with your name and finally to the Documents folder. Using a path this writes as:

with open("C:\\Windows\\Users\\uweschmitt\\Documents\\data.txt", "w") as fh:
    print("hi", file=fh)

For Mac OS the folder structure differs and the path is:

with open("/Users/uweschmitt/Documents/data.txt", "w") as fh:
    print("hi", file=fh)

And on Linux (it fails here, because I work with a Mac):

with open("/home/uweschmitt/Documents/data.txt", "w") as fh:
    print("hi", file=fh)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-22-c2912f815d85> in <module>()
----> 1 with open("/home/uweschmitt/Documents/data.txt", "w") as fh:
      2     print("hi", file=fh)

FileNotFoundError: [Errno 2] No such file or directory: '/home/uweschmitt/Documents/data.txt'

So the delimiter for the folders depends on the operating system and the location of the home folder as well.

A very short and preliminary introduction to Python lists

Python provides helpful data types which collect data and these types are often called container types. list is one of them.

A list in Python starts with an opening [ and a closing ], the following example uses a list holding three values of type int, namely 1, 2 and 3:

data = [1, 2, 3]
print(data)
print(type(data))
[1, 2, 3]
<class 'list'>

Similar to using quotes for delimiting a string, square brackets are used to delimit the elements of a list.

The types of the items in a list are arbitrary and can be mixed:

mixed_list = [1, "2", 3.14]

To compute the number of items in a list, we use the len function:

print(len(mixed_list))
3

The empty list is written []:

print(len([]))
0

To access elements in a list we use [] as we did it to access characters in a string, again indexing starts with $0$:

print([1, 2, 3][0])
print(mixed_list[2])
1
3.14

So we see the use of brackets in different situations:

  • to access characters of a string
  • to access elements of a list
  • to declare a list

Handling csv files with Python

Python ships with a module named csv which helps to read and write .csv files.

Why to use this module ?

  • The csv module is able to handle all variants (so called "dialects") of this file format and also all special cases, e.g. when the actual delimiter is part of a cell.
  • The csv module representa a row of a .csv. as a list contating the cell elements. This simplifies handling of .csv files.

Thus it is recommended to use this module instead of resorting to manual string handling as we did it in exercise block 5.

Again we first have to import csv to access its attributes.

Writing to a csv file

To write to .csv files we use the writer function from the csv module:

  • csv.writer function requires a file handle to a file opened in write mode.
  • ... and returns a "csv writer object" which has the writerow method.
  • Similar to a file handle this object represents the corresponding csv file and all interactions with this file are executed using methods of this object:
import csv

with open("example.csv", "w", newline="") as fh:
    w = csv.writer(fh, delimiter=",")
    w.writerow(["a", "b", "c"])
    w.writerow([1, "2", ","])
    w.writerow([2, 3, 7])

About the previous example:

  1. We have to open the file with an extra argument newline="" which is required on Windows and does no harm on Mac OS or Linux.
  2. Then we create a special csv writing handle w by calling csv.writer(fh). w is just a variable name and may be modified.
  3. w.writerow accepts a single Python list representing a row, the types of the cells (list elements) are arbitrary.
  4. The second call of w.writerow shows why self written csv handling code might fail: we have a cell containing the , delimiter as data

Comment: if you look at the previous example you see that writerow accepts a list where the type of the values can be mixed. You also see, that the writerow just writes the cell contents and doesn't care about the chosen delimiter.

Now we display the result from the previous script, you should see the csv file in the project explorer of PyCharm as well (if you repeat the example the output might slightly differ on your machine depending on your operating system):

with open("example.csv", "r", newline="") as fh:
    for line in fh:
        print(repr(line))
'a,b,c\r\n'
'1,2,","\r\n'
'2,3,7\r\n'

You can see that the cell with , is written as ",". This is according to the csv file format specification.

Reading a csv file

Reading from a csv file can be done by iterating with for over the handle object returned by csv.reader: In this case the for iterates over the lines of the input file and transforms the contents of the cells of the current line to a list. So for every iteration you get a list of cell contents:

import csv

with open("example.csv", "r", newline="") as fh:
    for row in csv.reader(fh):
        print("current row as list is", row)
current row as list is ['a', 'b', 'c']
current row as list is ['1', '2', ',']
current row as list is ['2', '3', '7']

Comments:

  • the items in the row list are always strings, so if you want to compute with number you will need type conversion.
  • you can see that the , is retrieved correctly.

Other delimiters than ,

Often .csv files have ; as delimiters, or .tsv file tab characters. In this case you can specify these when calling csv.reader and csv.writer with the extra names parameter delimiter. For example:

import csv

with open("example2.csv", "w", newline="") as fh:
    w = csv.writer(fh, delimiter="\t")                 ### this is new !
    w.writerow(["a", "b", "c"])
    w.writerow([1, "2", ","])
    w.writerow([2, 3, 7])
    
with open("example2.csv", "r", newline="") as fh:
    for line in csv.reader(fh, delimiter="\t"):        ### this is new !
        print("current line as list is", line)
current line as list is ['a', 'b', 'c']
current line as list is ['1', '2', ',']
current line as list is ['2', '3', '7']

Comments:

  • look at the actual created file.
  • replace "\t" by ";", rerun the code and look at the created file

How to skip a line

To skip the header line we can use the next function. This function reads one row and returns it. A following for will start from the current row, not from the beginning:

import csv

line_number = 0
with open("example2.csv", "r", newline="") as fh:
    r = csv.reader(fh, delimiter="\t")
    header = next(r)  # reads one row from the csv file
    print("header is", header)
    for line in r:
        a = line[0]
        b = line[1]
        print("a+b is", int(a) + int(b))
header is ['a', 'b', 'c']
a+b is 3
a+b is 5

Exercise block 6

  1. Reproduce the previous examples for lists and csv handling
  2. Rewrite the solutions from 5.2 and 5.3 using the csv module.
  3. Download http://siscourses.ethz.ch/python_dbiol/data/amino_acids.csv and save it into your PyCharm project folder.
  4. Use the csv module to display the content of this file.
  5. Write a script which computes the average of the masses listed in the column "Monoisotopic".
  6. (optional) write a script which determines the amino acid having maximal mass.
  7. Write a script which asks the user for a one-letter code of an amino acid and prints the chemical formula and average mass for the given symbol. Hint: Iterate over the lines using the csv module, check for every row (=list) if the one-letter symbol matches and if this is the case extract the needed information from the current list (row) !
  8. Extend this to display an error message if the given one letter code was not found in the csv file.