Bioinformatics Bites: MedGen

This bioinformatics bite is going to be a little but more clinically oriented:

A patient is presenting with excess blood clotting, which she thinks might be related to “something that runs in her family”. How do I find known diseases and genes (if any) that are associated with that phenotype?

A good place to start to look for information about symptoms and diseases that are related to genetics is MedGen. This database organizes information related to human medical genetics, like symptoms (clinical features), related genes, diseases, or genomic loci.

A perfectly reasonable approach would be to type “clotting” into the MedGen search box. Here’s what those results look like:


There are 94 results, the first of which is a clotting disorder, but one that is associated with too little clotting rather than too much clotting. If you scroll down, you see records that are not actually diseases:


To find out what type of record you’re looking at, look at the text after concept ID (blue boxes). The screen captures above show a Disease or Syndrome, a Finding, and a Pharmacologic substance. Notice that the diseases has links to other databases (green circle) and the others do not.

So how do we specify that we’re looking for a patient symptom related to a genetic disease? Like all the other NCBI databases, MedGen has field tags.

Here are some useful ones:

  • Clinical Features: short stature[clinical features] – records for diseases that are associated with short stature
  • Related Genes: LMNB1[gene] – diseases associated with this gene
  • Disease name: achrondroplasia[title] – this disease
  • Chromosome: 6[chromosome]- diseases associated with alterations to chromosome 6

Also, if you look back at the first screen shot, you can see a link that says “See MedGen results with clotting as a clinical feature (5)“. MedGen automatically sensed that clotting was a clinical feature, or symptom, and narrowed your results down for you.

Now we’ve narrowed the MedGen results to those that have clotting listed as a clinical feature. If you read the description, you see that Factor V deficiency is the only one associated with excess clotting. The record also shows what gene is associated with this disorder (F5) and links to descriptions from other resources like GeneReviews and OMIM, as well as Professional guidelines and Recent clinical studies.

result page.png

So how do you find out if this is in fact what your patient has? Find out next week!

-C. Tobin Magle, PhD, Biomedical Sciences Research Support Specialist


Bioinformatics Bites: GEO2R

This blog series has covered how to use both GEO Datasets (which holds both curated and uncurated datasets) and GEO Profiles (which holds expression profiles for individual genes from curated data sets).

But what if you want to see expression profiles of a gene from an uncharted dataset? That’s where GEO2R comes in. Once you’ve identified a dataset by searching GEO Datasets, you can start using GEO2R in 5 easy steps:

  1. Pick your experiment
  2. Define sample groups
  3. Assign samples to groups
  4. Perform the test
  5. Interpret the results table

Pick your experiment:

You need an accession number to start using GEO 2R. Which one do you use? Let’s use this dataset from Toxoplasma gondii as an example.

search result

This record contains accession numbers (boxed in red) for the series, samples, and platform. GEO2R is looking for the Series Accession number, GSE73177.  Enter this number into the search field in GEO2R, or click the Analyze with GEO2R link (blue arrow):

definte groups

Define sample groups:

This experiment is measuring expression levels in 3 groups (parent strain, knockout strain, and complemented knockout strain**). Thus, we need to create 3 sample groups. To do this, click “Define Groups” link (green circle above.) This action activates a popup that allows you to enter free text to name the groups (red box, wt, ko, and comp in the following example)

define groups 2.png

Assign samples to groups: 

Now you need to tell the program which samples belong to each group by selecting the samples that you want to put into a group, then clicking on the group you want to add them 2 in the Define Groups popup. In the example below, I have selected the complemented knockout samples (highlighted yellow) and will click the “comp” group to add them. After they are added to a group, the corresponding colors change to that of the group and the group column is populated, as in the case of wt and ko in the example. Repeat this process for all of your groups.

assign to group.png

Perform the test:

Get a list of the top 250 differentially expressed genes using the default settings*, scroll down and click the Top 250 button under the GEO2R tab.

A table containing the top 250 differentially expressed probes from the platform that the probe ID, p-value, adjusted p-value, F statistic and probe sequence.

Interpret the results table:

Clicking on the probe ID will show you a graph of the gene expression among the groups you have specified. You can also click Sample Values to get the number values represented on this graph.


In this case, it looks like the gene that is probed by 55.m10280_at is highly expressed in the knockout relative to the wild type, but doesn’t revert to wild type levels in the complemented strain.

To determine the gene name of the probe used in this experiment, visit the corresponding platform record for this series. (You can find this by searching the Series accession in GEO datasets, and using the platform filter.) Then, scroll to the bottom of the page to see the platform data table, which gives probe IDs, identifiers for genes in the toxoplasma genome database, annotation,chromosome location and a description of the gene function if available. Search for the probe id in the table to find the corresponding gene. In this case, it’s a thioredoxin domain-containing protein.

But what if your gene of interest is not in the top 250? You can use the Profile Graph tab to search by probe id.

GEO2R also has basic QC tools. You can see the value distributions across samples to identify large scale problems in the dataset using the value distribution tab:

value distribution.png

Finally, you can retrieve the R script for the analyses run in the R script tab.

NCBI also has a comprehensive tutorial on this tool if you’re interested.

Let me know if you have any questions!

  • C. Tobin Magle, PhD, Biomedical Science Research Support Specialist

* For novice users, the default settings are a good place to start. The calculation uses the limma package, and you can view and change the default settings by clicking the Options tab.

options tab

** knocked out gene added back in to account for off target effects of the knockout.

Bioinformatics Bites: Expression of a single gene in GEO Profiles

This weeks bioinformatics bite will answer a question from the end of last week’s post:

What is the expression of ITGA2 gene in prostate cancer cells?

I’d check GEO profiles, because it is a gene centric question. Let’s start by typing in prostate cancer in the GEO profiles search box.

GEOProfiles Results

Note the link back to the GEO data set that this gene profile was derived from (green circle). You can also see the platform and the specific probe that measures this gene (orange box). Finally, you can see a cartoon of the expression level between sample and control on the right side of the record.

To narrow down your search to a specific gene, use the filters on the left to select an organism (red box) and a Gene symbol (blue box, make sure to check that you’re using the correct gene symbol). Let’s look for the ITGA2 gene in human derived samples.

GEOProfiles Results Filtered

After applying the filters (blue and red boxes), you can see the search strategy in the Search details (orange box).

To get a close up of the expression graph, click the cartoon on the right (green circle).

GEOProfiles Expression

At a glance, you can see that ITGA2 expression goes up when the microarray miR-205 is expressed (red bars). It also indicates how highly this gene is expressed relative to other genes from the same sample by percentile (blue squares).  It also lists the expression values from each sample in a table below, along with its rank.

If you need more information about how the samples were prepared, you can click on the GSM number in the table. From there, you can access general information about the experiment by clicking on the Series ID (GSE) on any sample page, or the original GEO profile record.

But what do you do if the gene you’re attempting to access data in an uncurated data set? NCBI has a tool for that: GEO2R. We will discuss how to use this tool next time.

  • C. Tobin Magle, PhD, Biomedical Science Research Support Specialist

Bioinformatics Bites: the GEO Databases

Wow! It’s been a long time since i’ve been able to take the time to write a post. I apologize. Sorry for the hiatus. I have been traveling and playing catchup and attending meetings post travel. I hope to get back to my weekly posts now.

This week’s bioinformatics bites addresses finding gene expression information using GEO profiles.

Background: The Gene Expression Omnibus (GEO) was created by the NCBI to store gene expression data from microarray experiments. Most of the content is still microarrays, but some NGS data is also present (with the raw reads in the SRA database). Additionally, it now contains other types of high-throughput genomics data, like CHiP-chip.

GEO exists as two separate databases:

  • GEO DataSets: original submitter-supplied records and curated data sets
  • GEO Profiles: Expression profiles of individual genes from curated GEO data sets

GEO Data Sets, which are labeled with GDS numbers, have three major components submitted:

  • Series (GSE) – List of expression profiles that conducted for the experiment (test, control, replicates)
  • Samples (GDS) – Information about the biological samples used in the experiments, including extraction procedures
  • Platforms (GPL) – What platform the samples were run on (like Affymetrix Mouse Genome 430 2.0 Array)

All 3 of these components are necessary to make use of the study. Both samples and series link to the CEL files, and platform gives you information about the chip the samples were run on. NCBI is working to assemble all of the components from each submitted study into a curated DataSet, but there is some lag in the process. NGS studies can’t be curated at this time.

After curation, the data sets are broken out by gene instead of experiment and the data are loaded into GEO profiles.

if you do a quick search for “cancer”, here are what the results look like.

GEO data Set Results GEO DataSets results

This looks a lot like the output of a Gene search, but with different filters. You can see in the red box that there are over 1000 cancer data sets in GEO, and that the top data set has 6 samples by following the red arrow. You can filter by study type (blue box), things like tissue or strain (purple box), or Organism (orange arrow). You also get a search details box, which shows that MeSH terms are applied to your text search, just like PubMed.

Now that we have a general idea of how the data bases are structures, I will answer a practical question in next week’s post:

What is the expression of ITGA2 gene in prostate cancer cells?

See you next week

  • C. Tobin Magle, Biomedical Sciences Research Support Specialist

Bioinformatics Bites: Repurposing publicly available data

I’m going in a bit of a different direction with this bioinformatics bite segment. Instead of explicitly describing how to use a database or tool, I wanted to tell you all about a little project that I’m working on with a vet student from CSU.

Just because I don’t have a lab or large amounts of research funding doesn’t mean I can’t do science! There’s a wealth of bioinformatic data publicly available online and some user friendly tools that are available. We’re using these freely available resources to ask questions about how the distribution of microbes in the environment correlates with specific landmarks.

Our research question is as follows: Are there differences in microbial populations at sites that are close to zoos in the NYC area compared to the farther away. To address this question, we are using data from the PathoMap project, which swabbed surfaces in transit stops all over New York City. (This project is also being expanded to the top 10 cities worldwide for public transit ridership by a project called MetaSub.) See this publication for more information.

pathomap Pathomap main page

We determined test and control sites by looking up how city planners determine the distance people are willing to walk to a transit stop. We are also using a tool called GeneGis 2 to analyze and visualize these data. See this publication for more information on GenGIS.

gengisGenGIS output from their wiki

Repurposing data forces you to think differently. Our research question was designed in the context of what data and tools were publicly available. This type of research will only get easier as the mindset of creating well-designed community resources expands. Initiatives like Big Data to Knowledge (BD2K) and the Center for Open Science are driving this new trend.

If you’re curious about data repurposing, or how to find datasets and tools, please see the bioinformatics section of our new Research Support Pages or set up a consultation to discuss your questions.

  • Tobin Magle, biomedical sciences research support specialist.

Bioinformatics Bites: PubChem Bioassay

The NCBI insights blog has another wonderful post about PubChem.

This week, the post focuses on using the Bioassay database and how it can help identify chemical reagents and drugs.

  • Tobin Magle, PhD, Biomedical Sciences Research Support Specialist

Bioinformatics Bites: Creating a custom database to search with BLAST

This week’s Bioinformatics Bite will show you how to run a BLAST search in a custom database that you create using Entrez queries. Also, if you’re interested, NCBI has a YouTube video that covers this topic.

To review, there are 2 ways to search the NCBI databases:

  1. Sequence homology via BLAST
  2. Text via Entrez

Remember Entrez? This search engine allows you to apply field tags to refine your search results. These field tags vary by database, which makes sense because different datatypes necessitate different contextual fields. For example, you can find the field tags for the sequence databases (Nucleotide, Protein, GSS, ESThere.

Let’s go back to our Constructing a BLAST Query post from last month. We’re going to focus on

Step 4: Choose your database (search set)

Here are my parameters:

  1. Algorithm typenucleotide BLAST
  2. Query: NM_001182936.1(Saccharomyces cerevisiae S288c Ras family GTPase RAS2)
  3. Search name: yeast ras2
  4. Database: refseq_RNA
  5. Specific algorithm: blastn (somewhat similar sequences)

I’m starting with a pre-made database and we’ll refine from there. Enter this information into nucleotide BLAST query page and hit BLAST.

Here is a graphic summary of the results:

ras2 alingment all(click to enlarge)

Let’s take a look at the results taxonomically by clicking Distance Tree of Results.

open distance tree

This action will open a new window.


The results are from 3 taxonomic groups: yeast, animals and protists. Let’s zero in on the animal group.

First, return to the original BLAST page.

Then, change the view so we only see sequences from animals by clicking Formatting options above the search summary and typing animals in the Organism field in the Limit results section. The taxid for animals will autofill as you type.

blast formatting

The top result from Drosophila wilistoni. Keep that e-value in mind.

dros eval format animals

Now, instead of filtering the search results from the refseqRNA database, let’s create a custom search set. First, click Edit and Resubmit above the search summary.

edit and resubmit

Now, limit the search to animals entering animals in the Organism field. This parameter reduces the number of records in the search set.

choose search set

Now, let’s look at the e-value from Drosophila wilistoni again.

dros eval animals only

It changed from 2e-47 to 1e-47, but all of the other values in the table stayed the same. it changes because the the e-value is dependent on the size of the search dataset. The larger the dataset, the more likely that you’ll get a coincidental match even though the sequences are related. Hence, the e-value went down as the size of the database went down.

I hope this tutorial has illustrated the importance of creating custom search sets and recording algorithm parameters.

  • Tobin Magle, Biomedical Sciences Research Support Specialist