Burkhard Morgenstern Institut f ür Mikrobiologie und Genetik

Burkhard Morgenstern

Institut für Mikrobiologie und Genetik

Molekulare Evolution und Rekonstruktion

von phylogenetischen Bäumen

WS 2006/2007

Phylogeny reconstruction based on molecular sequence data (DNA, RNA, protein sequences)

Multiple sequence alignment

Molecular phylogeny reconstruction relies on comparative nucleic acid and protein sequence analysis

Alignment most important tool for sequence comparison

Multiple alignment contains more information than pair-wise alignment

Tools for multiple sequence alignment

Y I M Q E V Q Q E R

Sequence duplicates in history (e.g. speciation event)

Y I M Q E V Q Q E R

Y I M Q E A Q Q E R

Y L M Q E V Q Q E R

Substitutions occur

Y I M Q E A Q Q E R

Y L M Q E V Q Q E R

YAI M Q E A Q Q E R

Y L M - - V Q Q E R V

Insertions/deletions (indels) occur

YAI M Q E A Q Q E R

Y L M - - V Q Q E R V

Y A I M Q E A Q Q E R

Y L M V Q Q E R V

because of insertions/deletions: sequence similarity no longer immediately visible!

Y A I M Q E A Q Q E R -

Y - L M V - - Q Q E R V

Alignment brings together related parts of the sequences by inserting gaps into sequences

Y - L M V - - Q Q E R V

Mismatches correspond to substitutions Gaps correspond to indels

Pairwise alignment: alignment of two sequences

Multiple alignment: alignment of N > 2 sequences

s1 R Y I M R E A Q Y E S A Q

s2 R C I V M R E A Y E

s3 Y I M Q E V Q Q E R

s4 W R Y I A M R E Q Y E

Assumtion: sequence family related by common ancestry; similarity due to common history

Sequence similarity not obvious (insertions and deletions may have happened)

s1 - R Y I - M R E A Q Y E S A Q

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Multiple alignment = arrangement of sequences by introducing gaps

Alignment reveals sequence similarities

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

General information in multiple alignment: Functionally important regions more conserved than

non-functional regions Local sequence conservation indicates functionality!

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Phylogeny reconstruction based on multiple alignment: Estimate pairwise distances between sequences

(distance-based methods for tree reconstruction) Estimate evloutionary events in evolution (parsimony

and maximum likelihood methods)

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Task in bioinformatics: Find best multiple alignment for given sequence set

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Astronomical number of possible alignments!

s2 - R C I V M R E A - - - Y E -

s3 Y I - - - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Astronomical number of possible alignments!

s2 - R C I V M R E A - - - Y E -

s3 Y I - - - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Computer has to decide: which one is best??

Questions in development of alignment programs:

(1) What is a good alignment?

→ objective function (`score’)

(2) How to find a good alignment?

→ optimization algorithm

First question far more important !

Before defining an objective function (scoring scheme)

What is a biologically good alignment ??

Criteria for alignment quality:

1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!

Species related by common history

Genes / proteins related by common history

2. Evolution: align residues with common ancestors!

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Alignment hypothesis about sequence evolution Mismatches correspond to substitutions Gaps correspond to insertions/deletions

s2 - R C I V M R E A - Y E - - -

s3 - - Y I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Alignment hypothesis about sequence evolution Search for most plausible scenario! Estimate probabilities for individual evolutionary

events: insertions/deletions, substitutions

s2 - R C I V M R E A - Y E - - -

s3 - Y - I - M Q E V Q Q E R - -

s4 W R Y I A M R E - Q Y E - - -

Alignment hypothesis about sequence evolution Search for most plausible scenario! Estimate probabilities for individual evolutionary

events: insertions/deletions, substitutions

Compute score s(a,b) for degree of similarity between amino acids a and b based on probability

of substitution

a → b (or b → a)

(Extremely simplified!)

Reason for different substitutin probabilities pa,b :

Different physical and chemical properties of amino acids

Amino acids with similar properties more likely to be substituted against each other

Use penalty for gaps introduced into alignment

Simplest approach: linear gap costs: penalty proportional to gap length

Non-linear gap penalties more realistic: long gap caused by single insertion/deletion

Most frequently used: affine linear gap penalties: more realistic, but efficient to calculate!

Traditional Objective functions:

Define Score of alignments as

Sum of individual similarity scores s(a,b) Minus gap penalties

Needleman-Wunsch scoring system for pairwise alignment (1970)

Pair-wise sequence alignment

T Y W I V

T - - L V

Example:

Score = s(T,T) + s(I,L) + s (V,V) – 2 g

Assumption: linear gap penalty!

T Y W I V

T - - L V

Dynamic-programming algorithm finds

alignment with best score.

(Needleman and Wunsch, 1970)

T Y W I V

T - - L V

Running time proportional to product of sequence length

Time-complexity O(l1 * l2)

Algorithm for pairwise alignment can be generalized to multiple alignment of N sequences

Time-complexity O(l1 * l2 * … * lN)

Not feasable in reality (too long running time!)

Heuristic necessary, i.e. fast algorithm that does not necessarily produce mathematically best alignment

`Progressive´ Alignment

Most popular approach to (global) multiple sequence alignment:

Progressive Alignment

Since mid-Eighties: Feng/Doolittle, Higgins/Sharp, Taylor, …

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Guide tree

WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASFQPVAALERIN

WLNYNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”

YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP

WCEAQTKNGQGWVPSNYITPVN-

YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP

WCEAQTKNGQGWVPSNYITPVN--------

WW--RLNDKEGYVPRNLLGLYP--------

AVVIQDNSDIKVVP--KAKIIRD-------

YAVESEA---SVQ--PVAALERIN------

WLN-YNE---ERGDFPGTYVEYIGRKKISP

WCEAQTKNGQGWVPSNYITPVN--------

WW--RLNDKEGYVPRNLLGLYP--------

AVVIQDNSDIKVVP--KAKIIRD-------

YAVESEA---SVQ--PVAALERIN------

WLN-YNE---ERGDFPGTYVEYIGRKKISP

Most important implementation: CLUSTAL W

CLUSTAL W; Thompson et al., 1994 (~17.000 citations)

Pairwise distances as 1 - percentage of identity Calculate un-rooted tree with Neighbor Joining Define root as central position in tree Define sequence weights based on tree Gap penalties calculated based on various

parameters

Problems with traditional approach:

Results depend on gap penalty

Heuristic guide tree determines alignment; alignment used for phylogeny reconstruction

Algorithm produces global alignments.

Problems with traditional approach:

Many sequence families share only local similarity

E.g. sequences share one conserved motif

The DIALIGN approach

Morgenstern, Dress, Werner (1996),PNAS 93, 12098-12103

Combination of global and local methods

Assemble multiple alignment from gap-free local pair-wise alignments (,,fragments“)

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

atc------taatagttaaactcccccgtgcttag

caaa--gagtatcacccctgaattgaataa

caaa--gagtatcacc----------cctgaattgaataa

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

Consistency!

atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa

More methods for multiple alignment:

T-Coffee PIMA Muscle Prrp Mafft ProbCons

Substitution matrices

Similarity score s(a,b) for amino acids a and b based on probability pa,b of substitution a -> b

Idea: it is more reasonable to align amino acids that are often replaced by each other!

Assumptions:

pa,b does not depend on sequence position

Sequence positions independent of each other pa,b = pb,a (symmetry!)

Compute score s(a,b) for degree of similarity between amino acids a and b:

Probability pa,b of substitution

a → b (or b → a), Frequency qa of a

Define

s(a,b) = log (pa,b / qa qb)

To calculate pa,b:

Consider alignments of related proteins and count substitutions

a → b (or b → a)

To calculate pa,b:

a → b (or b → a)

ESWTS-RQWERYTIALMSDQRREVLYWIALY

ERWTSERQWERYTLALMS-QRREALYWIALY

To calculate pa,b:

a → b (or b → a)

ESWTS-RQWERYTIALMSDQRREVLYWIALY

ERWTSERQWERYTLALMS-QRREALYWIALY

Problems involved:

1. Probability pa,b depends on time t since sequences separated in evolution: pa,b = pa,b (t)

2. Protein families contain multiple sequences: phylogenetic tree must be known!

3. Alignment of protein families must be known!

4. Multiple mutations at one sequence position

M. Dayhoff et al., Atlas of Protein sequence and Structure, 1978

PAM matrices

Calculation of pa,b(t) :

Consider multiple alignments of closely related protein families

Count occurrence of a and b at corresponding positions in alignments using phylogenetic tree

Estimate pa,b(t) for small times t

Calculate conditional probabilities p(a|b,t) for small t Normalize to distance 1 PAM (= percentage of

accepted mutations) Calculate p(a|b,t) for larger evolutionary distances by

matrix multiplication

Calculate pa,b(t) for larger evolutionary distances

Alternative: BLOSUM matrices

S. Henikoff and J.G. Henikoff, PNAS, 1992

Basis: BLOCKS database, gap-free regions of multiple alignments.

Cluster of sequences if percentage of similarity > L Estimate pa,b(t) directly.

Default values: L = 62, L = 50

Burkhard Morgenstern Institut f ür Mikrobiologie und Genetik

Documents

Transcript of Burkhard Morgenstern Institut f ür Mikrobiologie und Genetik

© Burkhard Riegels - IRGW

Fakultät für Mathematik Bernhard Lamel Oskar-Morgenstern ... · Bernhard Lamel Curriculum Vitæ Fakultät für Mathematik Universität Wien Oskar-Morgenstern-Platz 1 1090 Wien T

Brocke, Burkhard Zur Diagnose, Ätiologie und Therapie des …psydok.psycharchives.de/jspui/bitstream/20.500.11780/... · 2017. 3. 22. · Brocke, Burkhard Zur Diagnose, Ätiologie

Burkhard Stiller, Jan Gerke, Hasan, David Hausheer, Pascal ... · Institut f ür Technische Informatik und Kommunikationsnetze Computer Engineering and Networks Laboratory Burkhard

JS Bach - (Coro): Wie schön leuchtet der Morgenstern BWV 1 Piano Sheet Music

FFür die ür die ZZukunftukunft ggut ut

Burkhard Heim, atlantische Pyramiden und der Himmelstau

Burkhard Springer: Zoonosenradar - Erfolg durch die AGES

D-Burkhard Korn - Bach BWV 9xx Baroque Lute

Burkhard Krümmer burkhard.kruemmer@centracon

Wie schön leuchtet uns der Morgenstern Michael Praetorius Capellakantoreiarchiv.imslp.eu/archiv/a_cappella/motets/praetor... · 2018-05-30 · Morgenstern: p. 1 Wie schön leuchtet

D-Burkhard Korn - Bach BWV 8xx Baroque Lute

D-Burkhard Korn - Bach BWV 1xxx Baroque Lute

Schön ist alles, was man mit Liebe betrachtet. Christian Morgenstern.

201607 brisbane rost1 evolteaches - Bioinformaticsbioinformatics.org.au/ws/wp-content/uploads/sites/10/... · 2016. 8. 11. · © Burkhard Rost 1 Evo pro Burkhard Rost TUM Munich

Burkhard Heim Nachruf

Malaria - mte-academy.de · Malaria Dr. Burkhard Rieke DTM&H (Dr. Burkhard Rieke DTM&H (LLiivv.) Internist, Tropenmedizin, Internist, Tropenmedizin, Infektiologie Infektiologie Gelbfieberimpfstelle

Automatisch Christa PPS Enthusiasmus ist das schönste Wort der Erde. Christian Morgenstern.

07.03.2014© Burkhard Reutzel FFw Kefenrod1 Wetterhilfsmeldung.

Stammbaum von Wilhelm von Erikson und Hexe vom Treffenwald ... · Cliff von Argonaut 4 5 Mac I vom Morgenstern 4 5 Lessy vom Abendstern 4 5 Jerry vom Morgenstern 4 5 Hexe vom Morgenstern