Example solutions for script 06_introduction_to_files¶

Exercise 2.3¶

You will see a long string with mysterious symbols because every file is a string of symbols, nevertheless if is is a Word document, Excel sheet or Python program. It is up to the associated application to interpret those symbols when opening a file. For example a Word document is much more than only the typed text but contains formatting and structure information and a plain text file only containing the pure text could not represent this.

Exercise 3.2¶

NEW and not introduced in the sript: x += y is the same as x = x + y, so it increments x by y:

acc = 0

fh = open("numbers.txt", "r")
for line in fh:
    acc += int(line.rstrip())

fh.close()

print(acc)

15

Exercise 3.3¶

fh = open("short.fasta", "r")
for line in fh:
    if line[0] == ">":
        print(line.rstrip())
fh.close()

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA

Exercise 4.2¶

acc = 0

with open("numbers.txt", "r") as fh:
    for line in fh:
        acc += int(line.rstrip())

print(acc)

15

with open("short.fasta", "r") as fh:
    for line in fh:
        if line[0] == ">":
            print(line.rstrip())

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA

Exercise 4.3¶

with open("status_lines.txt", "w") as fh_out:
    with open("short.fasta", "r") as fh_in:
        for line in fh_in:
            if line[0] == ">":
                print(line.rstrip(), file=fh_out)

Exercise 5.1¶

If you want to understand how a program works, or why a program does not work as intended you can trace the flow of execution and the the current state of variables by inserting appropriate print function calls:

with open("short.fasta", "r") as fh:
    for line in fh:
        line = line.rstrip()
      
        print()
        print("line:", line)
        
        if len(line) > 0 and line[0] == ">":
            last_status = line
            count = 0
        elif line == "":
            print("COUNT:", count, last_status)
        else:
            count += len(line)
        print("last_status:", last_status)
        print("count:", count)

I ommited the long output.

The plain print() creates an empty line, this enhances the readability. Further I marked the "regular" output with COUNT: to distinguish this from the other lines.

Exercise 5.3 (includes solution for 5.2)¶

csv is an acronym for "comma separated file". This is a readable text format which can be imported into / exported from Excel and other spreadsheet programs.

with open("fasta_stat.csv", "w") as fh_csv:
    
    # write header
    print("description, symbol_count, gc content", file=fh_csv)
    
    with open("short.fasta", "r") as fh_fasta:
        
        for line in fh_fasta:
        
            line = line.rstrip()
            if len(line) > 0 and line[0] == ">":
                last_status = line
                count = 0
                gc_count = 0
            elif line == "":
                relative_gc_count = gc_count / count
                
                # for checking if output is as intended
                print(last_status, ",", count, ",", relative_gc_count)
                # write one line to csv file
                print(last_status, ",", count, ",", relative_gc_count, file=fh_csv)
            else:
                count += len(line)
                gc_count += line.count("G") + line.count("C")

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA , 740 , 0.595945945945946
>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA , 753 , 0.4847277556440903
>gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA , 748 , 0.570855614973262
>gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA , 744 , 0.47580645161290325
>gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA , 733 , 0.47885402455661663
>gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA , 718 , 0.5069637883008357

Exercise 5.4¶

First we create a suitable FASTA file without the empty lines:

with open("short_2.fasta", "w") as fh_out:
    with open("short.fasta", "r") as fh_in:
        for line in fh_in:
            line = line.rstrip()
            if line != "":
                print(line, file=fh_out)

This was the previous solution:

with open("short.fasta", "r") as fh:
    for line in fh:
        line = line.rstrip()
        if len(line) > 0 and line[0] == ">":
            last_status = line
            count = 0
        elif line == "":
            print(count, last_status)
        else:
            count += len(line)

740 >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
753 >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
748 >gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
744 >gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
733 >gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA
718 >gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA

For the new file we could display count and last_status in the code block after if len(line) > 0 and line[0] == ">": but for the very first status line count and last_status are not set.

To handle this issue we use an indicator value -1 for count which will be set to 0 after handling the first sequence description:

# preliminary but partially incorrect solution !

count = -1

with open("short.fasta", "r") as fh:
    for line in fh:
        line = line.rstrip()
        if len(line) > 0 and line[0] == ">":
            if count >= 0:   # this is False for the first status line, but True for all others
                print(count, last_status)
            last_status = line
            count = 0
        else:
            count += len(line)

740 >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
753 >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
748 >gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
744 >gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
733 >gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA

This is still not correct because we do not see the information about the last sequence. As the last line is not a status line we have to handle this situation at the end of the script:

count = -1

with open("short.fasta", "r") as fh:
    for line in fh:
        line = line.rstrip()
        if len(line) > 0 and line[0] == ">":
            if count >= 0:   # this is False for the first status line, but True for all others
                print(count, last_status)
            last_status = line
            count = 0
        else:
            count += len(line)
            
print(count, last_status)

740 >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
753 >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
748 >gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
744 >gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
733 >gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA
718 >gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA

Alternative solutions would be:

use an extra variable like first_status_line_seen which is initalizied with False and set to True after the first status line was read. Then call print(count, last_status) if this value is True.
use a counting variable to track the number of status_lines read and decide to call print depending on this value.