Sequence structure relationships / Protein folding

STBC2012: Computational Molecular Biology (Biologi Molekul Pengkomputeran)
Faculty of Science and Technology, Universiti Kebangsaan Malaysia


Welcome to STBC2012 - Computational Molecular Biology

Instructions

The content for the sections of course instructor Mohd Firdaus Raih are provided via this page       .

You will have access to the videos of the lectures, practicals and other related course materials.

The practical sessions and comprehension short quizzes correspond to the lecture or discussion sessions preceding them.

To carry out the practicals, follow the written instructions in the left column. Some practicals have accompanying videos that can guide or demonstrate to you the work that needs to be carried out.

Explanations, discussions and/or solutions to the practicals are provided in the right column of the corresponding practical.

Contents
Part 0. Introductions   
Part 1
. Protein folding - relationship of the sequence, structure and the protein/folding environment
Part 2. Searching databases and collecting datasets
Part 3. Sequence alignments: Global vs local alignments
Part 4. Sequence alignments: Searching a database for similar sequences using BLAST
Part 5. Searching with BLAST (...continued.)
Part 6. Multiple sequence alignments and general sequence alignment exercise
Part 7. Further analysis practice


Other course materials and references can be retrieved from the links in your UKMFolio accounts.  

The discussion and assessment materials will be delivered separately in live sessions.

 
Part 0:
Introductions

Lecture 1, part 1: Introduction and background.

   

Section objectives:
At the end of this section, you should be able to understand the use of the terms bioinformatics and computational biology and be prepared to integrate concepts associated with the central dogma of molecular biology into any computational analysis encountered at a later point in the course.

Course objective:
This section fulfills course learning objectives 1:
Able to understand the basic principles of algorithms in bioinformatics software.

       
Lecture 1, part 2: A bit of history.


   

Section objectives:
At the end of this section, you should have acquired an appreciation of the origins of bioinformatics and the history of computational analyses in molecular biology.

This is supplementary material already covered by other instructors for STBC2023.

 

 

       
Lecture 1, part 3: The two most common types of data in bioinformatics/computational biology.

Explore the GenBank database - here.
Explore the Protein Data Bank database - here
.


   

Section objectives:
At the end of this section, you should be able to identify the two most commonly used types of data in bioinformatics and associated resources.

This is supplementary material already covered by other instructors for STBC2023.

Course objective:
This section fulfills course learning objectives 2:
Able to understand the diversity of molecular data stored in various biological databases.

Lecture 1, part 4: Computational thinking.

Test your comprehension with this short quiz.


   

Section objectives:
At the end of this section, you should have a basic understanding of how computational thinking can be employed to solve problems.

 

Course objective:
This section fulfills course learning objectives 1:
Able to understand the basic principles of algorithms in bioinformatics software.

Lecture 1, part 5: What can the (sequence) data be used for?



    Course objective:
This section fulfills course learning objectives 1:
Able to apply the fundamental principles of molecular biology for computational analysis and to organise the different data types used in molecular biology.
 

Part 1:
Protein folding - relationship of the sequence, structure and the protein/folding environment

Lecture 2, part 1: Evolution, sequence alignments and protein folding.



Lecture 2, part 2: The protein's amino acid sequence determines the folding and is dependent on the folding environment
.



Supplementary video: "5 Challenges we could solve by designing new proteins" by David Baker.


For the following practical exercises, you can also opt to carry out the visualization directly at the NGL viewer site.
Enter the PDB ID codes provided on this page for each structure in the NGL viewer interface under the pull-down menu for File > Open > PDB [insert PDB ID]. The NGL viewer site provides the full spectrum of features that can be used with the NGL viewer.

   



Comprehension quiz:

Do this comprehension quiz to test and further expand your understanding of the protein folding problem.



Supplementary videos:

The protein folding problem: a major conundrum of science

Levinthal's paradox and the protein folding problem.


Viewer 1: Structure of Human Growth Hormone (PDB ID: 1hgu)



 

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

 

 

Human growth hormone is a water soluble protein / found in an aqueous environment.

As a result of this, the residues that are on the surface of the protein are hydrophilic. The residues at the core of the protein are hydrophobic.

Explore the structure at the PDB.

Download the sequence from the PDB and submit it to a hydrophobicity plot. ie. ProtScale

Viewer 2: Structure of Bacteriorhodopsin (PDBID: 2at9)



.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
 

 

Bacteriorhodopsin is a protein that functions to pump protons across the membrane to the outside of the cell.

Unlike the water soluble human growth hormone structure, bacteriorhodopsin is located in the cell membrane (traverses the membrane). The environment it is located in is hydrophobic.

As a result of this, the residues that are on the surface of the protein are hydrophobic and interacts with the hydrophobic environment of the lipid bilayer.

Explore the structure at the PDB. Observe how the protein is situated in the cell membrane.

Download the sequence from the PDB and submit it to a hydrophobicity plot. ie. ProtScale.
Compare the hydrophobicity profile for bacteriorhodopsin with the profile for human growth hormone.

 

 

Viewer 3: Structure of Porin (PDB ID: 2por)



Back to Top

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
 

 

Porin is a protein that functions as a pore through which molecules can traverse through the membrane.

Unlike the water soluble human growth hormone structure, porin is located in the cell membrane (traverses the membrane). The environment it is located in is hydrophobic.

Unlike the bacteriorhodopsin structure that consists of mainly alpha helices (helix bundle), the porin structure is a beta barrel - beta strands form a layer/wall for the pore.

As a result of this, the residues of the beta strands that interact with the membrane are hydrophobic, while the residues that line the pore are generally hydrophilic.

Explore the structure at the PDB. Observe how the protein is situated in the cell membrane. What is the quarternary structure of porin?

Download the sequence from the PDB and submit it to a hydrophobicity plot. ie. ProtScale

Compare the hydrophobicity profile for porin with the profiles for human growth hormone and bacteriorhodopsin.
Do both membrane proteins have similar hydrophobicity profiles?

Section objectives:
At the end of this section, you should be able to understand the basic principles of protein folding.

By understanding how a protein can spontaneously fold, you should be able to connect those principles to why proteins that share similar structures can share similar functions, and proteins that have similar structures can share similar sequences.

This forms the basis as to why searching for sequence similarity can be used to infer functional similarity between proteins.

You should also know the reason why a 30% sequence similarity cut-off point is often used to infer that two proteins are homologous.


Course objective:
This section fulfills course learning objectives 4:
Capable of applying bioinformatics in the exploration of macromolecular sequences and structures.

 

Now that we know the information needed for a protein to fold is encoded in its sequence, therefore, similarities in sequence will mean that proteins might share a similar fold/structure. Because they have a similar fold, they may therefore also share similar functions.

In this next video, we will explore concepts of how we can investigate evolutionary relationships using sequence alignments.

Lecture 3, part 1: Sequence alignments and evolutionary relationships.



Lecture 3, part 2: Sequence alignments and evolutionary relationships - understanding the basics of a 'meaningful' alignment.
   

 

 

 

Part 2: Searching databases and compiling datasets

Lecture 4, part 1. Sequence alignments in practice - some BLAST basics.


Lecture 4, part 2. Sequence alignments in practice - continued - collecting orthologous sequences using BLAST.


Lecture 4, part 3. Sequence alignments in practice - continued - multiple sequence alignments.


Before we begin the practicals, view this video to understand the basics regarding the sequence data that we will be using [**Compulsory component**].




Practice: (i) Compiling a sequence dataset:
As a background to the next practical, please take some time to view this video.



Coronavirus disease 2019 (COVID-19) is caused by the SARS-CoV-2 virus. It has been reported that pets such as cats can also be infected by the virus. However, although the virus can infect cats, it does not seem to infect dogs, or if it does, it does not do it very well (see here).

What can you do to investigate this difference?
Can you come up with a hypothesis as to why cats are more susceptible to COVID-19 but not dogs?

Think about the points discussed in the video.
How does the SARS-CoV-2 enter the human cell?

Can you come up with a hypothesis as to why cats are more susceptible to COVID-19 but not dogs?

Assuming that the SARS-CoV-2 virus enters all its potential hosts cells in the same way, then it is possible that the difference as to how they are able to enter different host cells is due to the differences of the interaction of the spike protein with the ACE2 receptor in the different hosts.

It is also known at which residues that the spike proteins interact with the ACE2 (refer to a point from minute 4.18 of the video above). Hypothetically, if there is a mutation or difference of the ACE2 residues that interact with the spike proteins, then it is possible that the binding may be affected ie. they will either become less optimal, improved or result in no change in binding capacity.

One way to investigate this hypothesis to first compile a dataset of ACE2 protein sequences from the different hosts. Once all the ACE2 from the sequences from the different host organisms has been compiled, then they can be analysed to see if they are different and whether those differences could affect the capacity of the spike protein to bind to the ACE2. (Caveat)

Try to compile a dataset that will allow you to compare ACE2 orthologs in humans, orangutans, dogs, cats, canary, cattle and chickens. These animals were selected because they can be found close to humans such as in a zoo, as pets or as farm animals.

You can compile this dataset from the GenBank database. Although the GenBank database is usually searched using sequence alignments, it can also be searched using text queries.
Attempt to compile this dataset on your own by navigating the features and resources available at the NCBI's GenBank database.


(See one possible solution to compiling this dataset in the right column "Discussion".)

You can explore how to use the various resources available at the NCBI site via this link.

The concept above had been used to study the diversity of ACE2 receptors in various organisms and the possibility of the different animals studied being infected by the SARS-CoV-2 virus - Read the article here.


Practice: (ii) Compiling a structure dataset:
The Protein Data Bank (PDB) is the central repository of biological macromolecular structure coordinates. You can use the PDB to explore various aspects of protein structures in order to better understand their structure and how the structure relates to a particular protein's function. Explore the features and the types of data that can be found in this database.

Let us consider a scenario where you would like to compile dataset of protein structures from bacteria that are bound to DNA. You also want to ensure that the structure is of high resolution . In addition to those requirements, you also want to select only structures that were deposited from 2019 to the present day. How can we go about compiling this dataset?

Let's open a PDB Advanced Search page . Can you figure out how to carry out the search so that you will be able to retrieve the data entries that fit your criteria above?

For the next exercise, we will use the Protein Data Bank in Europe (PDBe).

Find and open the Advanced search page of the PDBe database.

1. Your task is to find structures of bacterial proteins that may be binding to RNA and there is an additional requirement that the structures must have been solved by X-ray diffraction.

-- Can you provide the number of structures that the search retrieved?

2. Your task is to find structures of proteins from the order Burkholderiales that may be binding to RNA.

-- Can you provide the number of structures that the search retrieved?

Compare the results of the two searches.
Compare the results of your own search to the solution provided in the Discussion section in the right column.

Searching methods such as this can be used to generate a dataset of proteins that can be used for specific analyses.


Back to Top

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

 

 

(i) Discussion
The concepts and theory for this practical is discussed in the Lecture on Databases and Data Structures.

More information about the ACE2 protein are available at Wikipedia and UniProt.

Once you have completed this section and understand more about the ACE2 protein. You can then use the sequences you have collected as a dataset for a multiple sequence alignment.

Solution to compiling the ACE2 dataset:
You can navigate the GenBank search features to search for the ACE2 homologs.
1. Enter "Protein" as the database option and "ACE2" as the search term.
2. Once the results of your search are listed, you can choose to view the option "Orthologs".
3. This will then display a list from which you can select the species of interest.
4. Following options include donwloading the sequences as a dataset or submitting them as a sequence alignment directly using NCBI's COBALT multiple alignment program.


Caveat
Take note that just the mere fact of the interacting amino acid residues on the ACE2 orthologs being different in a multiple sequence alignment cannot be used directly to conclude that the interactions are actually different. Further steps need to be carried out for you to be able to characterize these interactions deeper. However, this analysis can be a step in that direction. Read more about these interactions here.


(ii) Solutions to the PDBe searches (on the left)
1. You can generate a list of PDB entries that satisfy the following requirements:
- structures are from bacterial species only (Organism parameter);
- structures were solved using the X-ray diffraction method;
- structures contain RNA chains (Molecule type).
The list that would have been retrieved for the above parameters is here.

2. You can generate a list of PDB entries that satisfy the following requirements:
- structures from the order Burkholderiales (Organism parameter).
-structures that contain RNA chains (Molecule type).
The list that would have been retrieved for the above parameters is here.


Section objectives:

At the end of this section, you should be able to understand that databases can hold different types of data. There are various ways you can interrogate a database. You should also be able to understand the Search systems of databases in order to effectively interrogate them using the quering system provided. Once you understand the types of data available, you should then be able to compile a specific dataset for your own use.


Course objective:
This section fulfills course learning objectives 3 and 4: Able to choose and use suitable bioinformatic software to solve relevant problems.
Capable of applying bioinformatics in the exploration of macromolecular sequences and structures. 




Test your understanding:
Once you feel that you are ready and comfortable with searching and compiling datasets from a database, please carry out the graded assignment that will be provided via an email link or via UKMFolio.
***The assessment is a graded exercise and will contribute to your final grade.
***

 

Part 3:
Sequence alignments: Global vs local alignments
Using the sequences provided below (Sample1 and Sample2), carry out alignments using a:
1. a local alignment tool (L-align is a local alignment tool that allows for pair-wise alignments)
2. a global alignment tool (ClustalW is a global alignment program)

Revise Lecture 3 to refresh your understanding of global vs local alignment.

What are the differences that you are able to observe between the alignments produced by the two methods?

>Sample1
MPPRPSSGELWGIHLMPPRILVECLLPNGMIVTLECLREATLITIKHELFKEARKYPLHQ
LLQDESSYIFVSVTQEAEREEFFDETRRLCDLRLFQPFLKVIEPVGNREEKILNREIGFA
IGMPVCEFDMVKDPEVQDFRRNILNVCKEAVDLRDLNSPHSRAMYVYPPNVESSPELPKH
IYNKLDKGQIIVVIWVIVSPNNDKQKYTLKINHDCVPEQVIAEAIRKKTRSMLLSSEQLK
LCVLEYQGKYILKVCGCDEYFLEKYPLSQYKYIRSCIMLGRMPNLMLMAKESLYSQLPMD
CFTMPSYSRRISTATPYMNGETSTKSLWVINSALRIKILCATYVNVNIRDIDKIYVRTGI
YHGGEPLCDNVNTQRVPCSNPRWNEWLNYDIYIPDLPRAARLCLSICSVKGRKGAKEEHC
PLAWGNINLFDYTDTLVSGKMALNLWPVPHGLEDLLNPIGVTGSNPNKETPCLELEFDWF
SSVVKFPDMSVIEEHANWSVSREAGFSYSHAGLSNRLARDNELRENDKEQLKAISTRDPL
SEITEQEKDFLWSHRHYCVTIPEILPKLLLSVKWNSRDEVAQMYCLVKDWPPIKPEQAME
LLDCNYPDPMVRGFAVRCLEKYLTDDKLSQYLIQLVQVLKYEQYLDNLLVRFLLKKALTN
QRIGHFFFWHLKSEMHNKTVSQRFGLLLESYCRACGMYLKHLNRQVEAMEKLINLTDILK
QEKKDETQKVQMKFLVEQMRRPDFMDALQGFLSPLNPAHQLGNLRLEECRIMSSAKRPLW
LNWENPDIMSELLFQNNEIIFKNGDDLRQDMLTLQIIRIMENIWQNQGLDLRMLPYGCLS
IGDCVGLIEVVRNSHTIMQIQCKGGLKGALQFNSHTLHQWLKDKNKGEIYDAAIDLFTRS
CAGYCVATFILGIGDRHNSNIMVKDDGQLFHIDFGHFLDHKKKKFGYKRERVPFVLTQDF
LIVISKGAQECTKTREFERFQEMCYKAYLAIRQHANLFINLFSMMLGSGMPELQSFDDIA
YIRKTLALDKTEQEALEYFMKQMNDAHHGGWTTKMDWIFHTIKQHALN

>Sample2
MLASLRERLESGPVPSAFRFAGRGRGCSPHLSASPMFSGTGAGTATEHGAAAAAARPAAA
AVGAVPAAAYGACASYSGCASHDGGDKRDFYVLENMDLLDDADEKTGTAASASRKDNDDG
KMKTGAIYGNQSIDQSIAEALMGIGGRPALQVMRVSGLDAALTPSSGPSDSTEEEQEEAS
PAESKKRQRHRCEELDIDEDYLGKIDVMTEEVSSVVQKILAAAASVAPPTPPLTPASAMS
SPADSFHGSTRAVFFGSGYNEKDSGVIFESVNADSPLLPQVVPRSPGVVDTPPPRFPSPR
VYLTADGLNMDLCILDGVNSDGGFGDVVFMKDKTTKESFAFKKSKDTPAGRRQMEREVAA
LSDVRGKVSHVVHVQEISSTSGAPMLLMRREPGTLSSMLCNCLNTDPRALVRVLCGVLTA
VGGLHDMQYVHRDIKPNNILVDGHGDARLTDFGMCLKMDEARLRENIGDGTPCYSAPETS
DPRGGGATVVSEIYAVSVVLVEGMLNKCAFGEEELLEAAEDRLQVLEVAREDLLEKIEDE
IQDDEDEAELEEVRIKEHYLGVVAEDGWVPNEEARAKLDQLGRPLSFFMAQSDERGDDSL
SFEEKGTAARREAQAHPRRLFEHNWVEWLTTSSRAASDEGLRQVIAFVLDRGMIDPRPEA
RPQTVEAAKELLRLALAIYEVEAASAEGTEQDDA

. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

 

 



Discussion
After doing the alignments using a local and global alignment; you will find that for the local alignment, you will be able to observe the alignment below that cannot be observed in the global alignment.

The residues D and N in the sequence DXXXXN are functionally important. The DFG motif is also common to kinases. This serves to demonstrate how local alignments can detect local similarities although the general sequence similarity may be low. These local similarities may be of functional relevance; ie have a specific conserved role such as binding sites or active sites while the general function of the proteins may differ. In this particular example, both proteins are kinases that function in different pathways.

Learn more about Sample1 and Sample2.


Section objectives:
At the end of this section, you should be able to understand the difference between how global and local alingment algorithms carry out the alignment and when the different options can be used.


Course objective:
This section fulfills course learning objectives 3 and 4: Able to choose and use suitable bioinformatic software to solve relevant problems.
Capable of applying bioinformatics in the exploration of macromolecular sequences and structures. 


 

Part 4:
Sequence alignments: Searching a database for similar sequences using BLAST
Using the sequences provided previously (Sample1 and Sample2), carry out a BLAST search.

Notice that there are several options to run a BLAST search.
Given the type of sequences provided for Sample1 and Sample2, what are the possible BLAST programs that you can use to search the database?


Go through the different BLAST programs and the types of databases that they search.

Can you think of scenarios when you would use each of the different BLAST programs?



What does the output of the BLAST searches mean?

Here is a short video that explains the Expect (E) value of a BLAST search.


Now that you have understood a bit more about the BLAST search. Lets try to carry out some searches.

Practice:
Open this FASTA file and search the appropriate database using the appropriate search program.

Your BLAST results for that sequence will be rather obvious. It is clear from the E values and the sequence similairty what the function of the query sequence is.


Back to Top


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

 


Discussion
Your BLAST searches should retrieve sequences that have been deposited in the GenBank database that can be divided into the two families as above - Sample1 and Sample2
.

Both the sequences provided are amino acid sequences. Therefore, the programs that can be used are blastp and tblastn. Although these two programs search using a protein sequence query, they carry out searches in different databases; blastp carries out a search in a protein sequence database while tblastn searches in a translated nucleotide database.

Visit the NCBI BLAST documentation page for more information and answers to frequently asked questions.

Video tutorials of BLAST by NCBI can be viewed here .

Your BLAST search should have returned alignments to sequences of coronavirus spike proteins. It is clear from the alignment and sequence identity that the query sequence is actually the SARS-CoV-2 spike protein.

Section objectives:
At the end of this section, you should understand the different BLAST programs available and have a basic understanding of how you can run a BLAST search to align query sequences to similar sequences in a database.

At this point, you should also have a basic understanding of how to interpret the output of a BLAST search as provided by the NCBI BLAST interface.


Course objective:
This section fulfills course learning objectives 3 and 4:
Able to choose and use suitable bioinformatic software to solve relevant problems.
Capable of applying bioinformatics in the exploration of macromolecular sequences and structures.
 



 

Part 5:
Searching with BLAST (...continued.)
Think about how you had carried out the previous two searches using BLAST. Did you merely paste the sequences and pressed the SUBMIT button?

Notice that there are also other features to the search interface such as the option to search different databases and to restrict the search to sequences from specific organisms or taxonomic levels.

Practice:

(i) Carry out a search with the same sequences, however, choose a different database option - PDB and restrict the search to only organisms from the order Burkholderiales. How are the results of these searches different from your earlier searches.

(ii) Repeat the search using the same queries, however, choose a different database option - Refseq and restrict the search to only organisms from the order Burkholderiales. How are the results of these searches different from your earlier searches.


Here is a short video that continues the discussion on the Expect (E) value of a BLAST search.

(iii) Now, try carrying out a similar BLAST search using this FASTA file . Using your understanding of the E value, identify the range of hits that are considered as having a high chance of being random alignments? Notice how the results are different in terms of the E values provided compared to the previous BLAST search in Part 4?

Back to Top

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
 

Discussion
(i) You will have noticed that the search retrieved no results or very little hits. Because the search was against the PDB, for Burkholderiales; this simply means that there are no/a very limited number of similar sequences for structures of species from the Burkholderiales order that are available in the PDB.

(ii) Changing the database to Refseq retrieved a considerably larger number of hits. This is mainly because the entries in the PDB are almost fully made up of experimentally solved structures. The process to acquire the structure coordinate data is comparatively more difficult and time consuming compared to sequencing and thus restricting their numbers.

(iii) The BLAST search retrieved hits that include alignments with E values of higher than 1. (See from minute 1:17 of the video on the left for an explanation) However, for some of the sequences with an E value of higher than 1, the sequence similarity is higher 30%. These are examples of the alignments achieved being random.


Section objectives:
At the end of this section, you should have a deeper understanding of how to interpret a BLAST search result. This section requires you to be able to differentiate results that are of significance that cen be selected for further analysis and ignore those that are not relevant for further scrutiny.

At this point, you should have a deeper understanding of features of the BLAST search output as provided by the NCBI BLAST interface.


Course objective:
This section fulfills course learning objectives 3 and 4:
Able to choose and use suitable bioinformatic software to solve relevant problems.
Capable of applying bioinformatics in the exploration of macromolecular sequences and structures. 



 

***************************************************
Part 6:
Multiple sequence alignments and general sequence alignment exercise

Before carrying out the multiple sequence alignment exercise, view the video below in order to understand the process and practices associated with aligning multiple sequences.



Now that you're up to speed on multiple sequence alignments, let us proceed with a practical session.

Download the set of sequences needed for this exercise here. Carry out a multiple sequence alignment of the sequences provided. One such a tool for multiple sequence alignments is Clustal Omega. You can also use other tools as mentioned in the video above.

You can also download the Jalview tool to view the alignments on your own computers.

Do you think the sequences in the set are similar enough to be aligned together? If not, repeat your alignment by removing any unrelated sequences.

Once you think you have made an optimal alignment with a set of related/homologous sequences; identify the residues that are conserved in all the sequences.

1. How many conserved residue positions can you find?

2. What are the residues involved?

3. Are you able to figure out what the functions of these residues may be?

4. Can you assign a function to the sequences "MysterySeq" and "MysterySeq2"?

5. Are "MysterySeq" and "MysterySeq2" homologs of each other? Are these two proteins homologous to the others in the final refined alignment that you have made?

6. In Part 2, you had downloaded a dataset of ACE2 orthologs. Use this set of orthologs in a multiple sequence alignment. From other sources such as literature or videos, can you identify the residues in ACE2 that are involved in binding to the SARS-CoV-2 spike protein? Are these residues conserved in the other orthologs that you have aligned?

Back to Top

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
 



Further reading:
To gain a better understanding of the functional insights that you can get from the alignments, you can read about some aspects of these protein sequences here.

The research article provided will help you answer the questions provided (1, 2, 3 on the left).


Section objectives:
At the end of this section, you should have be able to carry out a multiple sequence alignment (MSAs) using a set of sequences. You should be able to optimize a sequence alignment by removing unrelated sequence based on your understanding that MSAs are carried out using a global alignment.

At this point, you should also have a basic understanding of the information that can be extracted from a MSA, for example, you should be able to determine which amino acid residues are highly conserved and thus may be crucial for the protein's function. You should also be able to determine which regions are highly variable and thus more receptive of mutations.

Additionally, you should also be able to look for information related to the sequences from any other source that may be outside of the alignments and present in other databases or resources.


Course objective:
This section fulfills course learning objectives 3 and 4: Able to choose and use suitable bioinformatic software to solve relevant problems.
Capable of applying bioinformatics in the exploration of macromolecular sequences and structures.




Test your understanding:
Once you feel that you are ready and comfortable with searching and compiling datasets from a database, please carry out the graded assignment that will be provided via an email link or via UKMFolio.
***The assessment is a graded exercise and will contribute to your final grade.
***


 
Part 7:
Further analysis practice
Using the sequence provided below:

>alien_or_predator
ATGGCATCCACACACCAATCATCCACAGAACCCTCTTCCACAGGTAAATCTGAGGAAAC
GAAGAAAGATGCTTCGCAAGGGAGCGGGCAAGACTCCAAGAACGTAACCGTTACCAAAG
GTACCGGTTCCTCCGCCACCTCAGCTGCCATTGTCAAGACAGGAGGATCCCAAGGCAAA
GATTCCTCTACTACAGCGGGCTCTTCTAGTACTCAGGGACAGAAGTTCAGTACTACACC
TACCGACCCGAAAACTTTCAGCTCTGACCAAAAGGAGAAATCCAAAAGCCCAGCCAAAG
AAGTCCCGTCTGGTGGCGATAGTAAGTCCCAAGGTGACACCAAGTCTCAAAGCGACGCC
AAATCTTCTGGACAAAGTCAGGGCCAGTCTAAAGACAGCGGCAAATCATCTTCCGACAG
TAGCAAGAGTCACTCTGTCATCGGAGCTGTCAAAGACGTCGTTGCAGGCGCCAAAGATG
TCGCAGGAAAAGCCGTCGAGGATGCTCCTAGCATCATGCATACTGCAGTCGATGCTGTG
AAGAACGCAGCCACGACTGTGAAGGATGTGGCATCGTCGGCTGCATCGACTGTGGCGGA
GAAGGTAGTCGATGCTTACCACAGTGTGGTGGGAGACAAGACGGACGACAAGAAAGAGG
GCGAGCACAGCGGCGACAAGAAGGACGACTCCAAAGCTGGAAGTGGCTCTGGACAAGGT
GGTGACAACAAGAAGTCTGAAGGAGAGACTTCTGGCCAAGCAGAATCCAGCTCTGGCAA
CGAAGGAGCTGCTCCAGCCAAAGGCCGTGGTCGTGGACGGCCTCCAGCAGCTGCTAAAG
GAGTTGCTAAGGGTGCTGCAAAGGGCGCTGCCGCCTCCAAAGGAGCCAAGAGCGGTGCT
GAATCCTCCAAGGGAGGAGAACAGTCGTCAGGAGATATCGAGATGGCAGATGCTTCCTC
CAAGGGAGGCTCGGACCAGAGGGATTCCGCGGCGACCGTTGGCGAAGGTGGTGCATCAG
GCAGTGAGGGTGGAGCTAAGAAAGGCAGAGGGCGGGGCGCTGGTAAGAAAGCGGATGCG
GGTGATACGTCCGCTGAGCCGCCTCGGCGGTCGTCCCGCCTGACGTCTTCAGGTACAGG
GGCGGGTTCCGCTCCAGCTGCAGCGAAAGGCGGAGCGAAGCGTGCTGCTTCTTCCTCCA
GTACACCTTCCAACGCTAAGAAGCAAGCGACTGGAGGTGCTGGCAAAGCTGCTGCCACC
AAAGCAACTGCTGCCAAATCGGCAGCCTCTAAAGCTCCCCAGAATGGCGCAGGTGCCAA
GAAGAAGGGAGGAAAGGCTGGAGGACGGAAGAGGAAGTAA


Carry out a BLAST search to:
1. identify other similar DNA sequences;
2. identify other homologs;
3. Identify protein structures with a similar sequence.

What are your observations about the results retrieved and what can you summarize about the protein sequence provided?

Optional but highly recommended:
View this video about how you can best make use of the BLAST search engine and its associated resources.





Back to Top

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
 


Guide:
Consider the input sequence. In this case, the sequence provided is a nucleotide/DNA sequence.

What are the programs available that can be used?

For DNA sequence queries, the blastn, blastx and tblastx can be used.

Next, take into account the objective of the query.
Therefore the following can be done:
1. blastn
2. blastx, because this will translate the DNA into the encoded amino acid sequences and use those to search against a protein database and because proteins are more conserved than DNA, this will allow for better detection of homologs compared to a DNA level search.
3. blastx, but specifically against the PDB database.

You can further read about the protein here .


Section objectives:
At the end of this section, you should be familiar with the different BLAST programs and the situations that may require the usage of the different programs and databases to be used.

This section requires you to be be confident in interpreting the signifcance of the results as conclusive statements from the observation of your results.



Course objective:

This section fulfills course learning objectives 3 and 4: Able to choose and use suitable bioinformatic software to solve relevant problems.
Capable of applying bioinformatics in the exploration of macromolecular sequences and structures. 




Test your understanding:
Once you feel that you are ready and comfortable with searching and compiling datasets from a database, please carry out the graded assignment that will be provided via an email link or via UKMFolio.
***The assessment is a graded exercise and will contribute to your final grade.
***