Stockholm Bioinformatics Center, SBC
Lecture notes, main page

Lecture 18 Jan 2001 Per Kraulis

Databases in bioinformatics

6. Sequence motif databases


Pfam is a database of protein families defined as domains (contiguous segments of entire protein sequences). For each domain, it contains a multiple alignment of a set of defining sequences (the seeds) and the other sequences in SWISS-PROT and TrEMBL that can be matched to that alignment.

The database was started in 1996 and is maintained by a consortium of scientists, among them Erik Sonnhammer (CGR, KI, Sweden), Sean Eddy (WashU, St Louis USA), Richard Durbin, Alan Bateman and Ewan Birney (Sanger Centre, UK). Release 5.5 (Sep 2000) contains 2478 families.

The alignments can be converted into hidden Markov models (HMM), which can be used to search for domains in a query protein sequence. The software HMMER (by Sean Eddy) is the computational foundation for Pfam. The domain structure of protein sequences in SWISS-PROT and TrEMBL are available directly from the Pfam web sites, and it is also possible to search for domains in other sequences using servers at the web sites.

The technology behind Pfam/HMMER will be discussed in a lecture later in this course.

The Pfam database can be searched, or used to identify domains in a sequence, or downloaded from the websites above. An example of a multiple sequence alignment that defines a protein family (domain) is given for the Raf-like Ras-binding domain (Pfam name RBD, accession code PF02196).

The Pfam database is licensed under the GNU General Public License, which basically makes it available to anyone, but imposes the restriction that derivative works (new databases, modifications) must be made available in source form.


PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

It was started by Amos Bairoch, is part of SWISS-PROT and is maintained in the same way as SWISS-PROT. The basis of it are regular expressions describing characteristic subsequences of specific protein families or domains. PROSITE has been extended to contain also some profiles, which can be described as probability patterns for specific protein sequence families.

Regular expressions will be described in a lecture later in this course.

The site above can be used to search by keyword or other text in the entries, to search for a pattern in a sequence, or to search for proteins in SWISS-PROT that match a pattern. An example of a PROSITE regular expression is given for the Ras GTPase-activating proteins signature pattern (RAS_GTPASE_ACTIV_1, accession code PS00509).

Copyright © 2001 Per Kraulis $Date: 2001/01/19 11:54:29 $