Phylogenetic Analysis of Dihydrofolate reductase


Background

Dihydrofoloate Reductase Protein

Dihydrofoloate reductase (DHRF) is a reducing agent found in various different organisms. Its primary function is to use NADPH (e- donor) to reduce dihydrofolic acid to tetrahydrofolic acid. Tetrahydrofolic acid is then used to syntehsize certain purines, thymidylic acid, and amino acids.

It is composed of eight-stranded beta sheets connected to, and surrounded by, four alpha helices. The main active site contains a conserved Proline-Tryptophan sequence that is responsible for the actual reduction/binding of the additional hydrogen bond upon reduction.

The inhibition of Dihydrofoloate reductase has been associated with megaloblastic anemia, which comes as a result of foliate deficiency. Alternatively, the inhibition of DHRF can also be used as a potential cancer-reducing agent, since the lack of resources to produce folic acid could act as a cell-growth regulator in cancer cells.

Phylogenetic Analysis of Proteins

The phylogenetic analysis of proteins such as DHRF can help us understand the evolutionary differences of said protein between different organisms, seeing how different sequences for the same protein aligns with the nature and surrounding environmental conditions of the organism. In addition, it can also shed some light as to which what parts of the sequences are conserved between different organisms, meaning that we can see which parts are actually responsible for the production of the protein (coding sequences) or which parts are just "junk" (introns, fillers, etc.).

Methods

Data Acquisition

The species of interest was determined by using a list of model organisms provided by Gonzolo Benegas. Three species were chosen randomly from each domain and kingdom considering the differences between each other. The NCBI Genbank Accession ID's for each species were found for the DHFR protein. The package Entrez-direct was used to populate the fasta files that included the amino acid sequence for the DHFR protein for each of the species of interest.

Multiple Sequence Alignment

The software MUSCLE is used for Multiple Sequence Alignment for the amino acid sequences for each of the species. The method was published by Robert C. Edgar in 2004. This software was chosen because of it achieves high scores of accuracy and speed. The binary tree for alignment is constructed using neighbor-joining and then the Kimura disntace matrix (Edgar).

Sequence Logo Visualization

Sequence logos were generated with the use of visualizing the conservation of a position in the mutliple sequence alignment. The sequence logos were generated by uploading the aligned fasta files from MUSCLE onto the web application WebLogo. This web application provides the ease of generating sequence logo without having to install dependent packages onto which WebLogo is based on the software Delila created by the Schenider Lab in UC Berkeley (Crooks).

Phylogenetic Tree Construction

The package FastTree is used to generate maximum-liklelihood phylogenetic tree from the multiple sequence alignment generated by MUSCLE. FastTree is used because it is more accutate than PhyML and distance matrix methods because it uses the Jukes-Cantor Model. In order to visualize the tree from a newick format, the package Bio.Phylo is used to create a pyplot with the tree.


Figures

A. Sequence Logo of Position 1-50

B. Sequence Logo of Position 51-100

C. Sequence Logo of Position 101-150

D. Sequence Logo of Position 151-200

E. Sequence Logo of Position 201-250

F. Sequence Logo of Position 251-300

G. Sequence Logo of Position 301-350

H. Sequence Logo of Position 351-400

I. Sequence Logo of Position 401-450

J. Sequence Logo of Position 451-500

K. Sequence Logo of Position 501-555

L. Phylogenetic Tree


Discussion

Sequence Logos

Figures A-K are Sequence Logos of the Multiple Sequence Alignment across all the organisms we chose to study. For each figure, the x-axis represents the amino acid sequence position of DHRF and the y-axis represents the number of bits per amino acid represented across all organisms. In other words, we can say that the y-axis represents the frequency of a particular amino acid appearing at that location compared to the rest of the amino acids across all model organisms.

We can see that most position shows up blank, meaning that there isn't a concrete amino acid that's likely to show up at that particular positon. This "blank" amino acid is due to the MSA trying to align DHRF sequences from different types of organisms into one concrete sequence. Evidently, there is much variation in AA sequence between organisms, hence a lot of areas where there are gaps. In addition to variation, these locations could be blank as a result of transcription of areas of the original DHRF nucleotide sequence that aren't coding sequences. These regions could be regulatory elements (e.g. promoter, regulators, etc.), or perhaps they could also be introns that aren't directly involved in the physical structure of the DHRF protein.

Focusing more on the regions where there are bits present, we can see that some amino acid positions are more conserved than others. A larger letter corresponds to more organisms with that amino acid present in their DHRF sequence. For example, the Proline-Tryptophan (P-W) sequence in positions 111-112 seems to be conserved over most of the model organisms. This is in line with what we expect, since the Proline-Tryptophan sequence is crucial for the active site of the protein (see Background).

Generally, however, there seems to be much variation in amino acid sequence across all of the model organisms that we have chosen. Different organisms have evolved independently of each other, and so it's reasonable to presume that there will be some degree of variation across most organisms. Something to also consider, however, is that within a set of variations of amino acids within a position in the sequence, those amino acids are structurally similar to each other. For example, in position 120 the most commonly occuring variations of amino acids are Aspartic Acid (D) and Glutamic Acid (E). Both Aspartic Acid and Glutamic Acid have negatively charged side chains, meaning that they functionally act the same way. If not funcitonally similar, then the sizes are also similar as well.

To generalize our findings, the variation in amino acids among different model organisms may be different when looking at a sequence logo, but functionally (and structurally), the polypeptide sequence amounts to pretty much the same protein that does (almost) the same thing for all organisms. But then, the question begs, how do we determine which variation of the polypeptide sequence is the earliest one? In other words, which sequence is most similar to some common ancestor to all of the model organisms? This is where the phylogenetic tree can help us out.

Phylogenetic Tree

Viruses have the most diverse sequences in terms of th DHFR protein. This is shown in how all three viruses chosen are in three different clade. One is in a clade protists, another with veterbrates, and another with fungi.

Prokaryotes are in a separate branch of the tree as explained by the evolutionary differences between eukarya and prokaryotes. This can also be shown in the fact that prokaryotes and eukaryotes are different domains therefore most of prokaryotes are on a separate branch. One of the prokaryotes strays away from this: Pseudomonas florescens. This prokaryote is part of a clade with other plants.

All eukarya are under one branch of the phylogenetic tree. Not only do most kingdoms have their own clades but most phylums are in their own clades.

This can be shown in how fungi have their own clade and are in a different phylum group(ascomycota) within the domain of eukaryotes. This can be shown how vertebrates which are part of the phylum chordata is in their own separate clade than invertebrates which are part of the phylum arthropoda.

The invertebrate Drosophila busckii along with the protist Hartmannella cantabrigiensis strays away from this, where it shares a clade with vertebrates. But most of the invertebrates(two out of three) are in a different clade than the vertebrates. But all of the animals share a clade with the subspecies of vertebrates and invertebrates apart from the two that stray away from this, the protist Hartmannella cantabrigiensis and the virus Saimiriine gammaherpesvirus2.

Even though plants and animals are part of different kingdoms, it is shown that their sequences for the protein DHFR are more similar than the kingdom Fungi. There is a clade of all of the plants together including a prokaryote Pseudomonas fluorescens as mentioned above.

Protists are in a separate branch than eukaryotes which shows that even though protists are eukaryotes, the sequence of DHFR is more different than the eukaryotes which all share more of a similar sequence. Most protists share a clade (two out of three) with the virus Bacteriophage sp.

For the phylogenetic tree, no out group was chosen because all of the domains and kingdoms used are different from each other.


References