Databases and Sequence Alignments with BioEdit or Jalview

Nonetheless, databases of gene sequences do exist. We shall access GenBank by going through the Entrez gateway at NCBI.

A link to Entrez is on our main course page. Here's another.

We're going to search for, and compare, the genes for elastase (which we shall discuss later) in humans and pigs.

Select Nucleotide on the main Entrez page. The page pictured below will appear.

GenBank Search Page

Until one is well-practiced, searching is made easier by using the Index.

The Query Box now looks like this:

homo sapiens[Organism] AND elastase[Text Word] NOT inhibitor[Text Word]

Click Go

The result will be a page looking something like this:

The gene you want will be about # 10 (damned if I know why other things are ranked higher):

10: NM_001971

Homo sapiens elastase 1, pancreatic (ELA1), mRNA
gi|22507386|ref|NM_001971.3|[22507386]

Chose FASTA from the Display drop down menu, click Send to file, and save the gene sequence with the extension .fas.

Repeat the process using sus scrofa as the organism.

Follow the steps in the BioEdit tutorial to align the sequences. You should discover that the two genes are about 80% identical.

One must be very careful in interpreting the results of sequence alignments like this. Similar sequences may reflect an evolutionary relationship, or they may not. The converse also is true. Collect as much evidence as possible!

In general, if one is seeking evolutionary relationships, aligning protein sequences is more useful than aligning genes:

BioEdit is, among other things, an alignment editor, although it has many more capabilities. It is available free from the Department of Molecular Biology at North Carolina State University. JalView (written in Java) is an alignment editor and can retrieve and display protein structure files as well. It is available from the Barton Group at the University of Dundee in Scotland

Both will work with sequence alignments in many forms, including GenBank and the cross-platform FASTA (FAST Alignment) format. In these examples, we will use the FASTA format. Here is an example of that format:

>gi|112927|sp|P05067|A4_HUMAN Amyloid beta A4 protein precursor (APP) (ABPP)
MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTKTCIDTKEGIL
QYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVGEFVSDALLVPDKCKFLHQER
MDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFRGVEFVCCPLAEESDNVDSADAEEDDSDVWW

The above is a portion of the sequence of a protein. The first line must always begin with > and is a comment or identification line. This is followed by the sequence, using the one letter codes for bases or amino acids. The length of the lines is irrelevant.

BioEdit by itself is capable of pairwise alignment only. However, the program ClustalW comes with it, is installed in a subdirectory, and can be run from within BioEdit to create multiple alignments. Consult the Help file to learn how to configure this option. JalView does multiple alignments by accessing the ClustalW web site directly.

To do a pairwise alignment in BioEdit:

To make a nicely colored version of the alignment, select File/Graphic View

Here's what the Graphic View window will look like when you're done:

BioEdit will also translate DNA or RNA to protein, find cDNA or the reverse complement, do a nice analysis with bar graph of the percentages of the four bases, and otherwise characterize your sequence.

For example, with your human elastase gene loaded and selected, choose Sequence/Nucleic Acid/Translate/Frame 1. You will get a new window containing the translation. Try this when you have a sequence displayed in Graphic View also.

When downloading Jalview, Use the "Install Anywhere" button on the download page. Choose the version for your operating system; if you're not sure whether you have a Java Virtual Machine aleady installed, select that version for your OS. Then select Run from the window that appears. Ask the installer to put an icon on your desktop. Start the program by clicking the icon.

To do an alignment in Jalview, choose File/Input Alignment/from File. Navigate to your sequence file and click OK. The result will look like this:

In the new window that has appeared, choose File/Add Sequences/from File. Navigate to your next sequence and click OK. (Any number of sequences can be added in this way.)

Select both sequences by holding down Ctrl and clicking on their titles. Then choose Calculate/Pairwise Alignments. The result looks like this:

Click the "View in Alignment Editor" button at the bottom of this window. You can now choose to color your alignment by any of a variety of schemes. I chose the Clustalx scheme, which assigns each amino acid a color. I then chose Wrap from the Format menu to produce:

You can save your alignment as a picture by choosing "Export Image" from the file menu; this gives you the choise of a web page (html), a Postscript file (eps) or a Word-compatible picture (png).

Multiple sequence alignments are done by reading in all of the files of interest and choosing Web Service; select your preferred server from the menu.

Here are the URLs for a number of web servers that will carry out alignments on sequences that you submit:

ClustalW MAFFT
KAlign Muscle
STRAP (also does structure alignments)

Easiest is to have your multiple sequences combined into a single file: simply use NotePad or any text editor. Paste one FASTA file after another into a single FASTA file (include the title lines), and use the upload feature. Alternatively, you can just copy and paste, one after the other, into the alignment window.

Allignments produced by any of these servers can be read into BioEdit or JalView and formatted to suit.

Similarity Matrices Any alignment requires some method of judging the quality of fit between the two or more sequences. Any scoring scheme must account for gaps (which we saw in our alignment of the two elastase genes), substitutions, and insertions and deletions.

For nucleic acids, the scoring scheme can be relatively simple; since only for characters occur in the sequences, substitutions are generally not present. BioEdit uses the following values:

Variation Score
Match 2
Mismatch -1
Gap Initiation -3
Extending Gap by 1 -1

JalView penalizes -12 for opening a gap and -2 for a gap extension.

Such a scoring system allows us to compare several possible alignment and choose the "best" - the one with the highest score.

In proteins, amino acid substitutions occur frequently, especially among amino acids of similar physicochemical properties. Alanine (methyl side chain) and valine (isopropyl side chain) are common substitutions for each other, that can occur without making a significant difference in the properties of a protein. Hence, more complicated schemes than that above are required to score alignments.

Perhaps the most widely used scoring matrices for proteins are the BLOSUM (blocks substitution matrix) developed by S. and J. G. Henikoff ["Amino acid substitution matrices from protein blocks", Proc. Natl. Acad. Sci. 1992, 89 10915-10919.] .

Here is the BLOSUM62 matrix. The diagonal represents identity; positive values are colored lavender.

A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

A few explanatory comments:

To look up the value corresponding to the substitution "N->W" (an Asparagine is replaced by a Tryptophan), go to the line of the table labeled "N" and follow this line until you get to the column labeled "W": -4. The substitution "N->W" is not very likely; that is, it does not occur often in closely related proteins. On the other hand, the substitution "V->I" (BLOSUM score: 3) is very likely.

BioEdit itself can only do pairwise alignments. However, the download includes a copy of the multiple alignment tool ClustalW, which can align many sequences. You can use this directly, or configure BioEdit to run it for you.


This page last modified 11:11 AM on Sunday January 25th, 2009.
Webmaster, Department of Chemistry, University of Maine, Orono, ME 04469