SlideShare uma empresa Scribd logo
1 de 35
FBW
23-10-2018
Wim Van Criekinge
Google Calendar
Bioinformatics.be
http://course.fast.ai/index.html
RECAP: Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]
RECAP: Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
RECAP
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Strings
REGULAR EXPRESSIONS
Regular Expressions
http://en.wikipedia.org/wiki/Regular_expression
In computing, a regular expression, also
referred to as "regex" or "regexp", provides a
concise and flexible means for matching
strings of text, such as particular characters,
words, or patterns of characters. A regular
expression is written in a formal language that
can be interpreted by a regular expression
processor.
Really clever "wild card" expressions for
matching and parsing strings.
Understanding Regular Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language
unto themselves
• A language of "marker characters" -
programming with characters
• It is kind of an "old school"
language - compact
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
s Matches whitespace
S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression Module
• Before you can use regular expressions in
your program, you must import the library
using "import re"
• You can use re.search() to see if a string
matches a regular expression similar to
using the find() method for strings
• You can use re.findall() extract portions of
a string that match your regular expression
similar to a combination of find() and
slicing
Wild-Card Characters
• The dot character matches any
character
• If you add the asterisk character,
the character is "any number of
times"
^X.*:
Match the start of the line
Match any character
Many times
Matching and Extracting Data
• The re.search() returns a True/False
depending on whether the string matches
the regular expression
• If we actually want the matching strings
to be extracted, we use re.findall()
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print y
['2', '19', '42']
Warning: Greedy Matching
• The repeat characters (* and +) push outward in both directions
(greedy) to match the largest possible string
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+:', x)
>>> print y
['From: Using the :']
^F.+:
One or more
characters
First character in the
match is an F
Last character in the
match is a :
Non-Greedy Matching
• Not all regular expression repeat codes are
greedy! If you add a ? character - the + and *
chill out a bit...
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+?:', x)
>>> print y
['From:']
^F.+?:
One or more
characters but
not greedily
First character in the
match is an F
Last character in the
match is a :
Fine Tuning String Extraction
• Parenthesis are not part of the match -
but they tell where to start and stop what
string to extract
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008
>>> y = re.findall('S+@S+',x)
>>> print y
['stephen.marquard@uct.ac.za']
>>> y = re.findall('^From (S+@S+)',x)
>>> print y
['stephen.marquard@uct.ac.za']
^From (S+@S+)
The Double Split Version
• Sometimes we split a line one way and then grab
one of the pieces of the line and split that piece
again
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008
words = line.split()
email = words[1]
pieces = email.split('@')
print pieces[1]
stephen.marquard@uct.ac.za
['stephen.marquard', 'uct.ac.za']
'uct.ac.za'
The Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:1
y = re.findall('@([^ ]*)',lin)
print y['uct.ac.za']
'@([^ ]*)'
Look through the string until you find an at-sign
Match non-blank character
Match many of them
Escape Character
• If you want a special regular expression
character to just behave normally (most
of the time) you prefix it with ''
>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('$[0-9.]+',x)
>>> print y
['$10.00']
$[0-9.]+
A digit or periodA real dollar sign
At least one
or more
Real world problems
• Match IP Addresses, email addresses,
URLs
• Match balanced sets of parenthesis
• Substitute words
• Tokenize
• Validate
• Count
• Delete duplicates
• Natural Language processing
RE in Python
• Unleash the power - built-in re module
• Functions
– to compile patterns
• compile
– to perform matches
• match, search, findall, finditer
– to perform operations on match object
• group, start, end, span
– to substitute
• sub, subn
• - Metacharacters
Regex.py
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print ('Found "%s" at %d:%d' % (text[s:e], s, e))
27
Reading Files
name = open("filename")
– opens the given file for reading, and returns a file object
name.read() - file's entire contents as a string
name.readline() - next line from file as a string
name.readlines() - file's contents as a list of lines
– the lines from a file object can also be read using a for loop
>>> f = open("hours.txt")
>>> f.read()
'123 Susan 12.5 8.1 7.6 3.2n
456 Brad 4.0 11.6 6.5 2.7 12n
789 Jenn 8.0 8.0 8.0 8.0 7.5n'
28
File Input Template
• A template for reading files in Python:
name = open("filename")
for line in name:
statements
>>> input = open("hours.txt")
>>> for line in input:
... print(line.strip()) # strip() removes n
123 Susan 12.5 8.1 7.6 3.2
456 Brad 4.0 11.6 6.5 2.7 12
789 Jenn 8.0 8.0 8.0 8.0 7.5
29
Writing Files
name = open("filename", "w")
name = open("filename", "a")
– opens file for write (deletes previous contents), or
– opens file for append (new data goes after previous data)
name.write(str) - writes the given string to the file
name.close() - saves file once writing is done
>>> out = open("output.txt", "w")
>>> out.write("Hello, world!n")
>>> out.write("How are you?")
>>> out.close()
>>> open("output.txt").read()
'Hello, world!nHow are you?'
Question 4
• Program your own prosite parser !
• Download prosite pattern database
(prosite.dat)
• Automatically generate >2000 search
patterns, and search in sequence set
from question 1
>SEQ1
MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT
YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS
LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS
LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT
NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK
SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVA
M MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI
ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK
SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR
>SEQ2
MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE
VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK
HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM
ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES
SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN
ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK
ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ
>SEQ3
MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY
SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLA
CISVDRY
LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP
QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET
CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGS
SSGH TSTTL
>SEQ4
MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG
GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG
FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK
ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ
IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR
FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI
GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF
DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC
VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI
NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG
KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE
GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA
SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE
KKGLA
Oefening 1
Question 3. Swiss-Knife.py
• Using a database as input ! Parse
the entire Swiss Prot collection
– How many entries are there ?
– Average Protein Length (in aa and
MW)
– Relative frequency of amino acids
• Compare to the ones used to construct
the PAM scoring matrixes from 1978 –
1991
Question 3: Getting the database
Uniprot_sprot.dat.gz – 528Mb
(on Github onder Files)
https://github.ugent.be/wvcrieki/Bioinformatics
blob/script_branch/Files/swiss-prot.dat
http://www.ebi.ac.uk/uniprot/download-center
Amino acid frequencies
1978 1991
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014
Second step: Frequencies of Occurence
Extra Questions
• How many records have a sequence of length 260?
• What are the first 20 residues of 143X_MAIZE?
• What is the identifier for the record with the
shortest sequence? Is there more than one record
with that length?
• What is the identifier for the record with the
longest sequence? Is there more than one record
with that length?
• How many contain the subsequence "ARRA"?
• How many contain the substring "KCIP-1" in the
description?

Mais conteúdo relacionado

Mais procurados

Puppet Camp Paris 2015: Power of Puppet 4 (Beginner)
Puppet Camp Paris 2015: Power of Puppet 4 (Beginner) Puppet Camp Paris 2015: Power of Puppet 4 (Beginner)
Puppet Camp Paris 2015: Power of Puppet 4 (Beginner) Puppet
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl TechniquesDave Cross
 
Functional Pattern Matching on Python
Functional Pattern Matching on PythonFunctional Pattern Matching on Python
Functional Pattern Matching on PythonDaker Fernandes
 
Programming Haskell Chapter8
Programming Haskell Chapter8Programming Haskell Chapter8
Programming Haskell Chapter8Kousuke Ruichi
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slidesaiclub_slides
 
Streams for (Co)Free!
Streams for (Co)Free!Streams for (Co)Free!
Streams for (Co)Free!John De Goes
 
Refactor like a boss
Refactor like a bossRefactor like a boss
Refactor like a bossgsterndale
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perlworr1244
 
What's New in PHP 5.5
What's New in PHP 5.5What's New in PHP 5.5
What's New in PHP 5.5Corey Ballou
 
Linked to ArrayList: the full story
Linked to ArrayList: the full storyLinked to ArrayList: the full story
Linked to ArrayList: the full storyJosé Paumard
 

Mais procurados (19)

Begin with Python
Begin with PythonBegin with Python
Begin with Python
 
Strings
StringsStrings
Strings
 
Puppet Camp Paris 2015: Power of Puppet 4 (Beginner)
Puppet Camp Paris 2015: Power of Puppet 4 (Beginner) Puppet Camp Paris 2015: Power of Puppet 4 (Beginner)
Puppet Camp Paris 2015: Power of Puppet 4 (Beginner)
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
 
Input and Output
Input and OutputInput and Output
Input and Output
 
Lists
ListsLists
Lists
 
The Magic Of Elixir
The Magic Of ElixirThe Magic Of Elixir
The Magic Of Elixir
 
Functional Pattern Matching on Python
Functional Pattern Matching on PythonFunctional Pattern Matching on Python
Functional Pattern Matching on Python
 
Perl6 in-production
Perl6 in-productionPerl6 in-production
Perl6 in-production
 
Programming Haskell Chapter8
Programming Haskell Chapter8Programming Haskell Chapter8
Programming Haskell Chapter8
 
Perl 6 by example
Perl 6 by examplePerl 6 by example
Perl 6 by example
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slides
 
05php
05php05php
05php
 
Streams for (Co)Free!
Streams for (Co)Free!Streams for (Co)Free!
Streams for (Co)Free!
 
Refactor like a boss
Refactor like a bossRefactor like a boss
Refactor like a boss
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Docopt
DocoptDocopt
Docopt
 
What's New in PHP 5.5
What's New in PHP 5.5What's New in PHP 5.5
What's New in PHP 5.5
 
Linked to ArrayList: the full story
Linked to ArrayList: the full storyLinked to ArrayList: the full story
Linked to ArrayList: the full story
 

Semelhante a P3 2018 python_regexes

Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in pythonJohn(Qiang) Zhang
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonUC San Diego
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayUtkarsh Sengar
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingMuthu Vinayagam
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7decoupled
 
第二讲 Python基礎
第二讲 Python基礎第二讲 Python基礎
第二讲 Python基礎juzihua1102
 
第二讲 预备-Python基礎
第二讲 预备-Python基礎第二讲 预备-Python基礎
第二讲 预备-Python基礎anzhong70
 
Python tutorial
Python tutorialPython tutorial
Python tutorialRajiv Risi
 
app4.pptx
app4.pptxapp4.pptx
app4.pptxsg4795
 
Dynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship groupDynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship groupReuven Lerner
 
unit-4 regular expression.pptx
unit-4 regular expression.pptxunit-4 regular expression.pptx
unit-4 regular expression.pptxPadreBhoj
 
Slides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammersSlides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammersGiovanni924
 

Semelhante a P3 2018 python_regexes (20)

P3 2017 python_regexes
P3 2017 python_regexesP3 2017 python_regexes
P3 2017 python_regexes
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
 
P2 2017 python_strings
P2 2017 python_stringsP2 2017 python_strings
P2 2017 python_strings
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python Programming
 
Perl_Tutorial_v1
Perl_Tutorial_v1Perl_Tutorial_v1
Perl_Tutorial_v1
 
Perl_Tutorial_v1
Perl_Tutorial_v1Perl_Tutorial_v1
Perl_Tutorial_v1
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
第二讲 Python基礎
第二讲 Python基礎第二讲 Python基礎
第二讲 Python基礎
 
第二讲 预备-Python基礎
第二讲 预备-Python基礎第二讲 预备-Python基礎
第二讲 预备-Python基礎
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
app4.pptx
app4.pptxapp4.pptx
app4.pptx
 
R language introduction
R language introductionR language introduction
R language introduction
 
Dynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship groupDynamic languages, for software craftmanship group
Dynamic languages, for software craftmanship group
 
unit-4 regular expression.pptx
unit-4 regular expression.pptxunit-4 regular expression.pptx
unit-4 regular expression.pptx
 
Java string handling
Java string handlingJava string handling
Java string handling
 
Slides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammersSlides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammers
 

Mais de Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

Mais de Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 
Van criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotechVan criekinge 2017_11_13_rodebiotech
Van criekinge 2017_11_13_rodebiotech
 
P4 2017 io
P4 2017 ioP4 2017 io
P4 2017 io
 

Último

How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Celine George
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 

Último (20)

How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 

P3 2018 python_regexes

  • 1.
  • 5.
  • 7. RECAP: Lists • Flexible arrays, not Lisp-like linked lists • a = [99, "bottles of beer", ["on", "the", "wall"]] • Same operators as for strings • a+b, a*3, a[0], a[-1], a[1:], len(a) • Item and slice assignment • a[0] = 98 • a[1:2] = ["bottles", "of", "beer"] -> [98, "bottles", "of", "beer", ["on", "the", "wall"]] • del a[-1] # -> [98, "bottles", "of", "beer"]
  • 8. RECAP: Dictionaries • Hash tables, "associative arrays" • d = {"duck": "eend", "water": "water"} • Lookup: • d["duck"] -> "eend" • d["back"] # raises KeyError exception • Delete, insert, overwrite: • del d["water"] # {"duck": "eend", "back": "rug"} • d["back"] = "rug" # {"duck": "eend", "back": "rug"} • d["duck"] = "duik" # {"duck": "duik", "back": "rug"}
  • 9. RECAP if condition: statements [elif condition: statements] ... else: statements while condition: statements for var in sequence: statements break continue Strings REGULAR EXPRESSIONS
  • 10. Regular Expressions http://en.wikipedia.org/wiki/Regular_expression In computing, a regular expression, also referred to as "regex" or "regexp", provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor. Really clever "wild card" expressions for matching and parsing strings.
  • 11. Understanding Regular Expressions • Very powerful and quite cryptic • Fun once you understand them • Regular expressions are a language unto themselves • A language of "marker characters" - programming with characters • It is kind of an "old school" language - compact
  • 12. Regular Expression Quick Guide ^ Matches the beginning of a line $ Matches the end of the line . Matches any character s Matches whitespace S Matches any non-whitespace character * Repeats a character zero or more times *? Repeats a character zero or more times (non-greedy) + Repeats a chracter one or more times +? Repeats a character one or more times (non-greedy) [aeiou] Matches a single character in the listed set [^XYZ] Matches a single character not in the listed set [a-z0-9] The set of characters can include a range ( Indicates where string extraction is to start ) Indicates where string extraction is to end
  • 13. The Regular Expression Module • Before you can use regular expressions in your program, you must import the library using "import re" • You can use re.search() to see if a string matches a regular expression similar to using the find() method for strings • You can use re.findall() extract portions of a string that match your regular expression similar to a combination of find() and slicing
  • 14. Wild-Card Characters • The dot character matches any character • If you add the asterisk character, the character is "any number of times" ^X.*: Match the start of the line Match any character Many times
  • 15. Matching and Extracting Data • The re.search() returns a True/False depending on whether the string matches the regular expression • If we actually want the matching strings to be extracted, we use re.findall() >>> import re >>> x = 'My 2 favorite numbers are 19 and 42' >>> y = re.findall('[0-9]+',x) >>> print y ['2', '19', '42']
  • 16. Warning: Greedy Matching • The repeat characters (* and +) push outward in both directions (greedy) to match the largest possible string >>> import re >>> x = 'From: Using the : character' >>> y = re.findall('^F.+:', x) >>> print y ['From: Using the :'] ^F.+: One or more characters First character in the match is an F Last character in the match is a :
  • 17. Non-Greedy Matching • Not all regular expression repeat codes are greedy! If you add a ? character - the + and * chill out a bit... >>> import re >>> x = 'From: Using the : character' >>> y = re.findall('^F.+?:', x) >>> print y ['From:'] ^F.+?: One or more characters but not greedily First character in the match is an F Last character in the match is a :
  • 18. Fine Tuning String Extraction • Parenthesis are not part of the match - but they tell where to start and stop what string to extract From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 >>> y = re.findall('S+@S+',x) >>> print y ['stephen.marquard@uct.ac.za'] >>> y = re.findall('^From (S+@S+)',x) >>> print y ['stephen.marquard@uct.ac.za'] ^From (S+@S+)
  • 19. The Double Split Version • Sometimes we split a line one way and then grab one of the pieces of the line and split that piece again From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 words = line.split() email = words[1] pieces = email.split('@') print pieces[1] stephen.marquard@uct.ac.za ['stephen.marquard', 'uct.ac.za'] 'uct.ac.za'
  • 20. The Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:1 y = re.findall('@([^ ]*)',lin) print y['uct.ac.za'] '@([^ ]*)' Look through the string until you find an at-sign Match non-blank character Match many of them
  • 21. Escape Character • If you want a special regular expression character to just behave normally (most of the time) you prefix it with '' >>> import re >>> x = 'We just received $10.00 for cookies.' >>> y = re.findall('$[0-9.]+',x) >>> print y ['$10.00'] $[0-9.]+ A digit or periodA real dollar sign At least one or more
  • 22. Real world problems • Match IP Addresses, email addresses, URLs • Match balanced sets of parenthesis • Substitute words • Tokenize • Validate • Count • Delete duplicates • Natural Language processing
  • 23.
  • 24.
  • 25. RE in Python • Unleash the power - built-in re module • Functions – to compile patterns • compile – to perform matches • match, search, findall, finditer – to perform operations on match object • group, start, end, span – to substitute • sub, subn • - Metacharacters
  • 26. Regex.py text = 'abbaaabbbbaaaaa' pattern = 'ab' for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))
  • 27. 27 Reading Files name = open("filename") – opens the given file for reading, and returns a file object name.read() - file's entire contents as a string name.readline() - next line from file as a string name.readlines() - file's contents as a list of lines – the lines from a file object can also be read using a for loop >>> f = open("hours.txt") >>> f.read() '123 Susan 12.5 8.1 7.6 3.2n 456 Brad 4.0 11.6 6.5 2.7 12n 789 Jenn 8.0 8.0 8.0 8.0 7.5n'
  • 28. 28 File Input Template • A template for reading files in Python: name = open("filename") for line in name: statements >>> input = open("hours.txt") >>> for line in input: ... print(line.strip()) # strip() removes n 123 Susan 12.5 8.1 7.6 3.2 456 Brad 4.0 11.6 6.5 2.7 12 789 Jenn 8.0 8.0 8.0 8.0 7.5
  • 29. 29 Writing Files name = open("filename", "w") name = open("filename", "a") – opens file for write (deletes previous contents), or – opens file for append (new data goes after previous data) name.write(str) - writes the given string to the file name.close() - saves file once writing is done >>> out = open("output.txt", "w") >>> out.write("Hello, world!n") >>> out.write("How are you?") >>> out.close() >>> open("output.txt").read() 'Hello, world!nHow are you?'
  • 30. Question 4 • Program your own prosite parser ! • Download prosite pattern database (prosite.dat) • Automatically generate >2000 search patterns, and search in sequence set from question 1
  • 31. >SEQ1 MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVA M MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR >SEQ2 MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ >SEQ3 MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLA CISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGS SSGH TSTTL >SEQ4 MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA Oefening 1
  • 32. Question 3. Swiss-Knife.py • Using a database as input ! Parse the entire Swiss Prot collection – How many entries are there ? – Average Protein Length (in aa and MW) – Relative frequency of amino acids • Compare to the ones used to construct the PAM scoring matrixes from 1978 – 1991
  • 33. Question 3: Getting the database Uniprot_sprot.dat.gz – 528Mb (on Github onder Files) https://github.ugent.be/wvcrieki/Bioinformatics blob/script_branch/Files/swiss-prot.dat http://www.ebi.ac.uk/uniprot/download-center
  • 34. Amino acid frequencies 1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Second step: Frequencies of Occurence
  • 35. Extra Questions • How many records have a sequence of length 260? • What are the first 20 residues of 143X_MAIZE? • What is the identifier for the record with the shortest sequence? Is there more than one record with that length? • What is the identifier for the record with the longest sequence? Is there more than one record with that length? • How many contain the subsequence "ARRA"? • How many contain the substring "KCIP-1" in the description?