Today’s bioinformatics bite is the patron question we’ve been leading up to in previous posts:
How do I find the transcription start site of the human signal transducer and activator of transcription 1 gene?
Step 1: Find the official gene name:
Step 2: Search the gene database using the gene name and organism tags
STAT1[Gene Name] AND human[Organism]
Step 3: Profit. And by profit I mean find the transcription start site using this gene’s entry in the gene database. Here’s what the gene page after running the search above.
The top left shows basic information about the gene, like the Gene ID (6772), which is the unique identifier for this gene. Also, there’s an entry in the summary section for Orthologs, which links to the Gene entries for STAT1 in other organisms. But let’s get back to the original question.
Notice the Table of Contents in the right column. The links in this section let you jump to your section of interest on the Gene page, which is especially useful because there is A LOT of information contained in these pages.
Click on Genomic regions, transcripts, and products.
This brings up the Genome Browser. The dark gray bar near the top of the browser represents the genome and is marked with base pair positions. The green lines represent Transcripts. The red lines represent Coding Regions. There are two sets of transcripts here: those modeled by NCBI on the top, and those modeled by Ensembl at the bottom. They use prediction algorithms, so discrepancies are normal.
Let’s zoom in a bit on the 5′ end of the STAT 1 transcript to get a better look. (You can approximate the region i’m looking at using the genome position bar at the top.)
First, let’s talk about the difference between the transcripts and the coding regions. Coding regions only contain the protein coding parts of the transcript. They do not include the untranslated regions (UTRs), so I wouldn’t look here for the 5′ end of the transcript.
Transcripts contain the entire sequence that is thought to be transcribed. The thick green boxes represent the exons, and the thin green lines represent the introns. Some of the exons are dark green, indicating that they are coding exons. The light green exons are noncoding exons. Isn’t it nice that the coding exons line up with the coding region?
Next, look at the three transcripts. Notice that 2 of them are labeled with a number that starts with NM_ and the other one is labeled with XM_. These numbers are their accession numbers, which is the unique identifier for a nucleotide sequence. The ones that start with NM_ are backed up by experimental data, probably RNA-seq, and the XM_ ones are predictions based on sequence alone. If you mouse over the accession number, it will give you basic information about the transcript, including the accession number of the protein that it encodes (the NP_ number).
Now let’s get down to business answering the question! Scroll back to the top of the gene page and click the arrow next to display settings and select Gene Table.
This brings you to a pretty similar view of the transcripts in the genome browser, while adding a feature called Exon table at the bottom.
Click the plus sign to the left of one of the transcripts (preferable a NM_ transcript) to see the table.
This table contains the genomic positions of all of the exons in the chosen transcript, while indicating the coding parts. If you want to know the transcription start site, look at where the first Exon starts. Note that this gene is on the negative strand so the number ranges are backward. The genome browser flipped it around for you automatically.
Finally, note which chromosome in the Genomic Context section.
You can also export the exon table by selecting Send to: File at the top of the page and selecting Gene Table (text) from the dropdown menu.
Thanks for reading! Hope that everyone has a nice Fourth of July holiday!
– Tobin Magle, Biomedical Sciences Research Support Specialist