Today’s bioinformatics bite comes from my very first walk in question. A researcher wanted to know how to find a set of genes found in a given genomic region. This analysis is useful if your organism of interest has a known deletion associated with a phenotype, and you want to get at the molecular mechanisms of how that phenotype occurs
Let’s use the following genomic region as an example: Human X Chromosome, bases 151,073,054-151,383,976.
If you’re a visual person, UCSC genome browser is a good option to find this information. Here’s what the page looks like by default*.
The genomic region being shown is indicated in the upper left. if you want to change the region being shown, type the new region in the box in the middle, but make sure your syntax matches the example on the left. The example we’re using would be entered as chrX:151,073,054-151,383,976.
That is A LOT of information. If you want to remove the information you’re not interested in, click the “hide all” button below the image (inside red box in the image). Here’s what it looks like after:
Now let’s add some information back into the browser. Since we’re specifically interested in genes, go under the “Genes and Gene Predictions” tab and change the dropdown below your favorite gene annotation set from “hide” to “full”. I like using RefSeq genes.
Now we can see that the GABRA3 gene and 3 microRNA genes are present in this region. However, I don’t see a quick way of exporting the gene list, which can become a problem if you’re looking at a bigger region that contains more genes.
To get around this issue, you can use NCBI’s Gene database. First, click on the “Advanced” link below the search box.
One of the things I like most about NCBI is that you can use the dropdown menus on the search builder to select fields that make your search more specific. For example, you can select “Organism” from the dropdown menu, and type human in the corresponding search box, it restricts your search to human genes**. I can also specify the chromosome (Chromosome field) i’m interested in and the base pair range (base position field) in the advanced search as shown below.
Notice how the syntax is different. NCBI uses a colon (:) instead of a dash (-) to indicate a range of base pairs. More information about how to format fields can be found in the FAQ.
But what happens when I run this search?
I got 3 genes, but not the ones I found on UCSC genome browser. Why doesn’t the GABRA3 gene come up? To find out, I searched for human GABRA3 in the gene database: (human[Organism] AND GABRA3[Gene Name])
The answer lies in the “Assembly” and Location fields of the Genomic Context box***. Compare the assembly number to the one at the top of UCSC genome browser:
The UCSC genome browser is using assembly #36 and NCBI is already on #38. And we can see from the location field, the gene is now at basepairs 152,166,234-152,451,359 in the newest assembly. Now when you search this, you get the same genes from UCSC genome browser.
The moral of the story is, ALWAYS check the version of the data that you’re working with.
Now if you want a version of the table can be saved and is machine readable, click on the button next to Send to, select File and Click create file.
Hope this helped. Ask Us if you have any questions!
-Tobin Magle, Biomedical Sciences Research Support Specialist
* If your page doesn’t look like my screen shot, it’s probably because you’ve been here before and your browser remembers what you did last time. Reset the system by clicking “default tracks” under the image.
**If you type in human and search in all fields, it would pick up the word human anywhere in the gene record, including a description that says something like “this gene is a homolog of human protein…”.
***Even though the human genome is “complete”, it is still continually being tweaked as better data becomes available. For more information on NCBI genomic assemblies, look here.