Strings Genome 559: Introduction to Statistical and
21 Slides288.00 KB
Strings Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble
Strings A string is a sequence of letters (called characters). In Python, strings start and end with single or double quotes. “foo” ‘foo’ ‘foo’ ‘foo’
Defining strings Each string is stored in the computer’s memory as a list of characters. myString “GATTACA” myString
Accessing single characters You can access individual characters by using indices in square brackets. myString “GATTACA” myString[0] ‘G’ myString[1] ‘A’ myString[-1] Negative indices start at the end ‘A’ of the string and move left. myString[-2] ‘C’ myString[7] Traceback (most recent call last): File " stdin ", line 1, in ? IndexError: string index out of range
Accessing substrings myString “GATTACA” myString[1:3] ‘AT’ myString[:3] ‘GAT’ myString[4:] ‘ACA’ myString[3:5] ‘TA’ myString[:] ‘GATTACA’
Special characters The backslash is used to introduce a special character. "He said, "Wow!"" File " stdin ", line 1 "He said, "Wow!"" SyntaxError: invalid syntax "He said, 'Wow!'" "He said, 'Wow!'" "He said, \"Wow!\"" 'He said, "Wow!"' Escape sequence \\ Meaning Backslash \’ Single quote \” \n Double quote Newline \t Tab
More string functionality len(“GATTACA”) 7 “GAT” “TACA” ‘GATTACA’ “A” * 10 ‘AAAAAAAAAA “GAT” in “GATTACA” True “AGT” in “GATTACA” False Length Concatenation Repeat Substring test
String methods In Python, a method is a function that is defined with respect to a particular object. The syntax is object . method ( parameters ) dna “ACGT” dna.find(“T”) 3
String methods "GATTACA".find("ATT") 1 "GATTACA".count("T") 2 "GATTACA".lower() 'gattaca' "gattaca".upper() 'GATTACA' "GATTACA".replace("G", "U") 'UATTACA‘ "GATTACA".replace("C", "U") 'GATTAUA' "GATTACA".replace("AT", "**") 'G**TACA' "GATTACA".startswith("G") True "GATTACA".startswith("g") False
Strings are immutable Strings cannot be modified; instead, create a new one. s "GATTACA" s[3] "C" Traceback (most recent call last): File " stdin ", line 1, in ? TypeError: object doesn't support item assignment s s[:3] "C" s[4:] s 'GATCACA' s s.replace("G","U") s 'UATCACA'
Strings are immutable String methods do not modify the string; they return a new string. sequence “ACGT” sequence.replace(“A”, “G”) ‘GCGT’ print sequence ACGT sequence “ACGT” new sequence sequence.replace(“A”, “G”) print new sequence GCGT
String summary Basic string operations: S "AATTGG" # assignment - or use single quotes ' ' s1 s2 # concatenate s2 * 3 # repeat string s2[i] # index character at position 'i' s2[x:y] # index a substring len(S) # get length of string int(S) # or use float(S) # turn a string into an integer or floating point decimal Methods: S.upper() S.lower() S.count(substring) S.replace(old,new) S.find(substring) S.startswith(substring), S. endswith(substring) Printing: print var1,var2,var3 print "text",var1,"text" # print multiple variables # print a combination of explicit text (strings) and variables
Sample problem #1 Write a program called dna2rna.py that reads a DNA sequence from the first command line argument, and then prints it as an RNA sequence. Make sure it works for both uppercase and lowercase input. python dna2rna.py AGTCAGT ACUCAGU python dna2rna.py actcagt acucagu python dna2rna.py ACTCagt ACUCagu First get it working just for uppercase letters.
Two solutions import sys sequence sys.argv[1] new sequence sequence.replace(“T”, “U”) newer sequence new sequence.replace(“t”, “u”) print newer sequence import sys print sys.argv[1]
Two solutions import sys sequence sys.argv[1] new sequence sequence.replace(“T”, “U”) newer sequence new sequence.replace(“t”, “u”) print newer sequence import sys print sys.argv[1].replace(“T”, “U”)
Two solutions import sys sequence sys.argv[1] new sequence sequence.replace(“T”, “U”) newer sequence new sequence.replace(“t”, “u”) print newer sequence import sys print sys.argv[1].replace(“T”, “U”).replace(“t”, “u”) It is legal (but not always desirable) to chain together multiple methods on a single line.
Sample problem #2 Write a program get-codons.py that reads the first command line argument as a DNA sequence and prints the first three codons, one per line, in uppercase letters. python get-codons.py TTGCAGTCG TTG CAG TCG python get-codons.py TTGCAGTCGATC TTG CAG TCG python get-codons.py tcgatcgac TCG ATC GAC
Solution #2 import sys sequence sys.argv[1] upper sequence sequence.upper() print upper sequence[:3] print upper sequence[3:6] print upper sequence[6:9]
Sample problem #3 (optional) Write a program that reads a protein sequence as a command line argument and prints the location of the first cysteine residue. python find-cysteine.py MNDLSGKTVIITGGARGLGAEAARQAVAAGARVVLADVLDEEGAATARELGDAARYQHLDVTI EEDWQRVCAYAREEFGSVDGL 70 python find-cysteine.py MNDLSGKTVIITGGARGLGAEAARQAVAAGARVVLADVLDEEGAATARELGDAARYQHLDVTI EEDWQRVVAYAREEFGSVDGL -1
Solution #3 import sys protein sys.argv[1] upper protein protein.upper() print upper protein.find(“C”)
Reading Chapters 5 and 8 of Learning Python by Lutz.