Bioinformatics Bites: the GEO Databases

Wow! It’s been a long time since i’ve been able to take the time to write a post. I apologize. Sorry for the hiatus. I have been traveling and playing catchup and attending meetings post travel. I hope to get back to my weekly posts now.

This week’s bioinformatics bites addresses finding gene expression information using GEO profiles.

Background: The Gene Expression Omnibus (GEO) was created by the NCBI to store gene expression data from microarray experiments. Most of the content is still microarrays, but some NGS data is also present (with the raw reads in the SRA database). Additionally, it now contains other types of high-throughput genomics data, like CHiP-chip.

GEO exists as two separate databases:

  • GEO DataSets: original submitter-supplied records and curated data sets
  • GEO Profiles: Expression profiles of individual genes from curated GEO data sets

GEO Data Sets, which are labeled with GDS numbers, have three major components submitted:

  • Series (GSE) – List of expression profiles that conducted for the experiment (test, control, replicates)
  • Samples (GDS) – Information about the biological samples used in the experiments, including extraction procedures
  • Platforms (GPL) – What platform the samples were run on (like Affymetrix Mouse Genome 430 2.0 Array)

All 3 of these components are necessary to make use of the study. Both samples and series link to the CEL files, and platform gives you information about the chip the samples were run on. NCBI is working to assemble all of the components from each submitted study into a curated DataSet, but there is some lag in the process. NGS studies can’t be curated at this time.

After curation, the data sets are broken out by gene instead of experiment and the data are loaded into GEO profiles.

if you do a quick search for “cancer”, here are what the results look like.

GEO data Set Results GEO DataSets results

This looks a lot like the output of a Gene search, but with different filters. You can see in the red box that there are over 1000 cancer data sets in GEO, and that the top data set has 6 samples by following the red arrow. You can filter by study type (blue box), things like tissue or strain (purple box), or Organism (orange arrow). You also get a search details box, which shows that MeSH terms are applied to your text search, just like PubMed.

Now that we have a general idea of how the data bases are structures, I will answer a practical question in next week’s post:

What is the expression of ITGA2 gene in prostate cancer cells?

See you next week

  • C. Tobin Magle, Biomedical Sciences Research Support Specialist

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s