Search genes

Chromosome index

Documentation

Tutorial

Resource URLs

Representations

Disclaimer

Tutorial

To use the HGNC/wr web resource, all that is required is a programming language having libraries for dealing with HTTP and XML. (Every modern language has them; if it doesn't , it's not modern.) This is one of the advantages of a web service designed as a Resource-Oriented Architecture: It does not rely on the availability of special-purpose WS libraries for the programming language favoured by the potential user.

The programming language Python is used in this tutorial. This is just a matter of preference for this author; it should not be too difficult to translate the code into Java, Perl, Ruby or your language of choice.

The standard Python package urllib2 is all that is needed to fetch data from the HGNC/wr system. For most tasks, the fetched XML must be parsed by the client program. There are two main options: DOM (Document Object Model) and SAX (Simple API for XML). Both are supported by standard Python packages (xml.sax and xml.dom). In this case, we choose DOM and use the standard package xml.dom.minidom, which produces a tree-like data structure of nodes corresponding to XML elements, attributes and text. The nodes have methods that allow searching, traversal and access.

It would of course be possible to write a special-purpose package for accessing the HGNC/wr system which wrapped all details of HTTP and DOM. But it is instructive to see that standard, well-established, basic technologies are sometimes quite sufficient to perform useful tasks.

  1. Fetch XML document for gene FLT3
  2. Print FLT3 alias symbols
  3. Identify outdated symbols in a list of gene symbols
  4. Get UniProt accession codes and XML URLs for a list of genes
  5. Find all genes with a symbol beginning with the string 'FLT'
  6. Get XML for several genes in one call
  7. Find the gene(s) annotated with xref UniProt:P36888

1. Fetch XML document for gene FLT3

import urllib2 urltemplate = 'http://www.avatar.se/HGNC/wr/gene/%s.xml' url = urltemplate % 'FLT3' xmldata = urllib2.urlopen(url).read() print xmldata

We build the URL from a template into which the gene symbol is inserted. The web resource is opened with the 'urlopen' function of urllib2. This returns a file-like instance that can be read from. We store the result in a variable. The result string is then printed just to check that it really is the requested XML.

2. Print FLT3 alias symbols

import urllib2, xml.dom.minidom urltemplate = 'http://www.avatar.se/HGNC/wr/gene/%s.xml' url = urltemplate % 'FLT3' xmldata = urllib2.urlopen(url).read() dom = xml.dom.minidom.parseString(xmldata) for node in dom.getElementsByTagName('alias'): print node.childNodes[0].nodeValue

In this case, we need to pick out the information from the XML document for the gene. The XML must be parsed into a DOM node tree. All nodes with the tag name 'alias' are selected (corresponding to the XML element '<alias>') and the texts for those nodes are printed out. (In DOM, the text within an XML element is represented as the value of the first child node of the XML element node.) This works because there can be no other XML elements '<alias>' in this particular document.

The DOM interface is described by the Document Object Model (DOM) Level 1 Specification.

3. Identify outdated symbols in a list of gene symbols

import urllib2 urltemplate = 'http://www.avatar.se/HGNC/wr/gene/%s.xml' for symbol in ['FLT3', 'FLK2', 'INSR', 'CD220', 'DOES_NOT_EXIST']: url = urltemplate % symbol try: resource = urllib2.urlopen(url) if resource.geturl() != url: print symbol, 'is outdated!' else: print symbol, 'is current' except urllib2.HTTPError, msg: print symbol, str(msg)

The XML resource is accessed for each gene symbol. If the symbol is no longer current, then the HGNC/wr server will return a redirect command. This is acted on automatically by urllib2 without the client code having to do anything. We can check whether this happened: The response URL is obtained by calling the geturl() method of the file-like instance returned by urlopen. It will differ from the request URL if a redirection occurred.

If the gene symbol does not seem to exist at all, then an exception is raised, and we can check that it is the expected error by looking at the HTTP error code.

If we wanted to find the current gene symbol we would have to either pick apart the response URL, or read, parse and interpret the XML document. This is left as an exercise for the reader.

4. Get UniProt accession codes and XML URLs for a list of genes

import urllib2, xml.dom.minidom urltemplate = 'http://www.avatar.se/HGNC/wr/gene/%s.xml' for symbol in ['FLT3', 'INSR', 'HBA1']: url = urltemplate % symbol xmldata = urllib2.urlopen(url).read() dom = xml.dom.minidom.parseString(xmldata) for node in dom.getElementsByTagName('xref'): if node.getAttribute('xdb') == 'UniProt': print symbol, node.getAttribute('xkey') for node2 in node.getElementsByTagName('link'): if node2.getAttribute('format') == 'xml': print ' XML URL:', node2.getAttribute('xlink:href')

First get the XML document for each gene, and parse it. Loop through all xref nodes (XML element '<xref>') and test for those referring to UniProt in the 'xdb' attribute. Print out the value of the 'xkey' attribute, which is the UniProt accession code. Then loop through all link nodes provided for the xref, and print out the URL (attribute 'xlink:href') for those that have the correct expected format.

5. Find all genes with a symbol beginning with the string 'FLT'

import urllib, urllib2, xml.dom.minidom urltemplate = 'http://www.avatar.se/HGNC/wr/genes;index.xml?%s' url = urltemplate % urllib.urlencode({'search': 'symbol', 'value': 'FLT'}) xmldata = urllib2.urlopen(url).read() dom = xml.dom.minidom.parseString(xmldata) for node in dom.getElementsByTagName('gene'): print node.getAttribute('xlink:title')

We need an index of genes having a symbol beginning with 'FLT'. The URL '/HGNC/wr/genes;index.xml' is used for fetching an index in XML format. The search criteria are given as query parameters in the URL, after the '?' character. The parameters must be URL-encoded. The standard Python package urllib has a useful function that does this for a dictionary of key:value pairs.

Two parameters are needed: 'search' which determines the field to be searched (in this case the symbol), and 'value', which is the string to test for. By default, a non-exact search is done, where all symbols beginning with the specified string are matched. The wildcards '*' (= any number of any characters) and '?' (= any single character) may be used to make other types of searches.

As usual, parse the XML. Since this is an index, loop through all 'gene' nodes. Pick out the gene symbol as the 'xlink:title' attribute of each node.

6. Get XML for several genes in one call

import urllib2, xml.dom.minidom urltemplate = 'http://www.avatar.se/HGNC/wr/genes/%s' url = urltemplate % 'FLT3;INSR;HBA1;FOXD1' xmldata = urllib2.urlopen(url).read() dom = xml.dom.minidom.parseString(xmldata) for node in dom.getElementsByTagName('Gene'): print node.getAttribute('symbol'), node.getAttribute('acc')

It is much more efficient (both for the server and the client) to fetch large sets of data in one call, rather than small bits in many calls. To fetch an XML document that contains many genes (rather than just one), use the URL '/HGNC/wr/genes/{genespecs}', where genespecs is a semicolon-delimited list of gene identifiers or gene symbols. Aliases, previous or withdrawn symbols cannot be used in this case.

Get the XML, and parse it as usual. Here, we loop over all 'Gene' nodes and print out their symbols and HGNC identifiers (accession codes). The order of the genes in the XML document is undefined.

7. Find the gene(s) annotated with xref UniProt:P36888

import urllib, urllib2, xml.dom.minidom urltemplate = 'http://www.avatar.se/HGNC/wr/genes;index.xml?%s' url = urltemplate % urllib.urlencode({'search': 'xref', 'value': 'UniProt:P36888'}) xmldata = urllib2.urlopen(url).read() dom = xml.dom.minidom.parseString(xmldata) for node in dom.getElementsByTagName('gene'): print node.getAttribute('xlink:title')

Given an entry in some external database, it is often of interest to find the gene that has been associated with it. To do this, use the xref search. The search string is the cross-reference specified in the format 'xdb:xkey', where 'xdb' is the name of the external database (UniProt in this case), and 'xkey' is the identifier of the entry (the UniProt accession code P36888).

Get the XML, and parse it as usual. Here, we print out the title of the link(s), which contains the HGNC gene symbol. Of course, we expect only one gene, but this code also handles gracefully the case where more than one gene has been annotated with the given UniProt entry.