PubMed Changes are Coming

March Updates

By the end of March, the Strauss Health Sciences Library will transition to only linking to the new PubMed interface. So when you access PubMed from the Library, you will directed to the new interface. If you would like to access the legacy interface again before its retirement, you can use the banner in the new PubMed interface to return to the legacy view.

At some point this Spring or Summer, the National Library of Medicine will retire the legacy interface. We will post more specific information about that as we receive it.

If you have any questions about how to use the new PubMed interface, please AskUs!

February Updates

New features have been added to the new PubMed interface!

Items Per Page (see 1)

Display options now include items per page. Ten, twenty, fifty, one hundred, or two hundred results may be displayed per page.

New Sort Options: Publication Date (see 2), Reverse Sort Order (see 3)

In addition to Best match and Most recent, results may now be sorted by publication date either in ascending or descending order.

Download Results by Year as a CSV File (see 4)

The file will include the search query, the year, and the associated record count.

Similar Articles can now be displayed on a new page of results

Clicking “see all similar articles” (see 5) on any article abstract page will open a new results page listing all similar articles.

This was written by Ellie, you can contact AskUs with questions.

November Updates

A bit behind schedule but finally here, you can now find the new PubMed interface from the current PubMed browser.

Find the new PubMed interface

The new interface was built using modern web standards with a responsive layout, so it works more effectively on cell phones and tablets.

The updated Best Match sort uses a machine learning algorithm to elevate the most relevant articles to the top of your results list.

Starting in Spring 2020, this new interface will be the default for all PubMed users.

Read more about the changes to the interface from the NLM Technical Bulletin.

Have questions or feedback about the new PubMed interface? Contact NLM with your PubMed Labs Feedback.

This was written by Christi Piper, you can contact AskUs with questions.

September Updates

The new PubMed is going live this month! Are you ready?

We will use this space to keep you updated on the changes that occurring and provide tips and tricks for using the new interface. You can interact with the beta version of the new PubMed by visiting PubMed Labs. As you use the new interface, please provide NLM with your PubMed Labs Feedback as they will continue to make improvements to the interface until it becomes the default in January 2020.

Keep in mind that the beta interface is not currently a replacement for the current version of PubMed since it is not the complete database in regards to content or functionality yet.

Here are the most recent features that have been added to the new PubMed interface:

  • Filters have been added to narrow results by article type, text availability, publication date, species, language, sex, subject, journal category and age.
  • The Health Sciences Article Linker has been added! You can now get to our library holdings from the beta PubMed version.

Keep an eye on the library homepage for information about the new PubMed and quick links to access the site.

This was written by Ellie, you can contact AskUs with questions.

July Updates

In Fall/Winter 2019, PubMed will be undergoing some changes to the interface. If you want to see some of the changes that are coming before the current version of PubMed is replaced, you can visit PubMed Labs, the experimental platform that has some of the major updates already available.

Wondering what’s new? Here are some of the updated features:

Enhanced Search Results

The new version of PubMed (currently PubMed Labs) will have an enhanced relevant sort option, named Best Match, that ranks search results according to several relevance signals, including an article’s popularity, its publication date and type, and its query-document relevant score.

The search results page will now automatically include highlighted text fragments from the article abstract that are selected based on relevance to the search.

Responsive Design

Have you ever tried to use PubMed on your phone or tablet? The current version doesn’t work very well, but the new version of PubMed will feature a mobile-first responsive layout that offers better support for smaller device screens. The new interface will be compatible with any screen size no matter how you access PubMed.

Want to learn more about the new PubMed interface/PubMed Labs? Visit the NLM Technical Bulletin , where this information was taken from, for more details.

Have questions or feedback about the new PubMed interface? Contact NLM with your PubMed Labs Feedback.

This was written by Christi Piper, you can contact AskUs with questions.

“Text Mining, Natural Language Processing, and the Future of the Library”

“Text Mining, Natural Language Processing, and the Future of the Library”
Larry Hunter, PhD
Professor, Department of Pharmacology
University of Colorado Anschutz Medical Campus

Computational methods of information retrieval have revolutionized librarianship. Developments in text mining and natural language processing are likely to bring equally profound change to how scientists and clinicians interact with the biomedical literature!

Tuesday, October 3, 2017
Reading Room
3rd floor, Health Sciences Library
Lunch provided


Dr. Lawrence Hunter is the Director of the University of Colorado’s Computational Bioscience Program and a Professor of Pharmacology (School of Medicine)  and Computer Science (Boulder). He received a Ph.D. in computer science from Yale University in 1989, and then joined the National Institutes of Health as a
staff scientist, first at the National Library of Medicine and then at the National Cancer Institute, before coming to Colorado in 2000. Dr. Hunter is widely  recognized as one of the founders of bioinformatics; he served as the first President of the International Society for Computational Biology (ISCB), and
created several of the most important conferences in the field, including ISMB, PSB and VIZBI. Dr. Hunter’s research interests span a wide range of areas,  from cognitive science to rational drug design.


–Kristen Desanto

Bioinformatics Bites: MedGen

This bioinformatics bite is going to be a little but more clinically oriented:

A patient is presenting with excess blood clotting, which she thinks might be related to “something that runs in her family”. How do I find known diseases and genes (if any) that are associated with that phenotype?

A good place to start to look for information about symptoms and diseases that are related to genetics is MedGen. This database organizes information related to human medical genetics, like symptoms (clinical features), related genes, diseases, or genomic loci.

A perfectly reasonable approach would be to type “clotting” into the MedGen search box. Here’s what those results look like:


There are 94 results, the first of which is a clotting disorder, but one that is associated with too little clotting rather than too much clotting. If you scroll down, you see records that are not actually diseases:


To find out what type of record you’re looking at, look at the text after concept ID (blue boxes). The screen captures above show a Disease or Syndrome, a Finding, and a Pharmacologic substance. Notice that the diseases has links to other databases (green circle) and the others do not.

So how do we specify that we’re looking for a patient symptom related to a genetic disease? Like all the other NCBI databases, MedGen has field tags.

Here are some useful ones:

  • Clinical Features: short stature[clinical features] – records for diseases that are associated with short stature
  • Related Genes: LMNB1[gene] – diseases associated with this gene
  • Disease name: achrondroplasia[title] – this disease
  • Chromosome: 6[chromosome]- diseases associated with alterations to chromosome 6

Also, if you look back at the first screen shot, you can see a link that says “See MedGen results with clotting as a clinical feature (5)“. MedGen automatically sensed that clotting was a clinical feature, or symptom, and narrowed your results down for you.

Now we’ve narrowed the MedGen results to those that have clotting listed as a clinical feature. If you read the description, you see that Factor V deficiency is the only one associated with excess clotting. The record also shows what gene is associated with this disorder (F5) and links to descriptions from other resources like GeneReviews and OMIM, as well as Professional guidelines and Recent clinical studies.

result page.png

So how do you find out if this is in fact what your patient has? Find out next week!

-C. Tobin Magle, PhD, Biomedical Sciences Research Support Specialist


Bioinformatics Bites: GEO2R

This blog series has covered how to use both GEO Datasets (which holds both curated and uncurated datasets) and GEO Profiles (which holds expression profiles for individual genes from curated data sets).

But what if you want to see expression profiles of a gene from an uncharted dataset? That’s where GEO2R comes in. Once you’ve identified a dataset by searching GEO Datasets, you can start using GEO2R in 5 easy steps:

  1. Pick your experiment
  2. Define sample groups
  3. Assign samples to groups
  4. Perform the test
  5. Interpret the results table

Pick your experiment:

You need an accession number to start using GEO 2R. Which one do you use? Let’s use this dataset from Toxoplasma gondii as an example.

search result

This record contains accession numbers (boxed in red) for the series, samples, and platform. GEO2R is looking for the Series Accession number, GSE73177.  Enter this number into the search field in GEO2R, or click the Analyze with GEO2R link (blue arrow):

definte groups

Define sample groups:

This experiment is measuring expression levels in 3 groups (parent strain, knockout strain, and complemented knockout strain**). Thus, we need to create 3 sample groups. To do this, click “Define Groups” link (green circle above.) This action activates a popup that allows you to enter free text to name the groups (red box, wt, ko, and comp in the following example)

define groups 2.png

Assign samples to groups: 

Now you need to tell the program which samples belong to each group by selecting the samples that you want to put into a group, then clicking on the group you want to add them 2 in the Define Groups popup. In the example below, I have selected the complemented knockout samples (highlighted yellow) and will click the “comp” group to add them. After they are added to a group, the corresponding colors change to that of the group and the group column is populated, as in the case of wt and ko in the example. Repeat this process for all of your groups.

assign to group.png

Perform the test:

Get a list of the top 250 differentially expressed genes using the default settings*, scroll down and click the Top 250 button under the GEO2R tab.

A table containing the top 250 differentially expressed probes from the platform that the probe ID, p-value, adjusted p-value, F statistic and probe sequence.

Interpret the results table:

Clicking on the probe ID will show you a graph of the gene expression among the groups you have specified. You can also click Sample Values to get the number values represented on this graph.


In this case, it looks like the gene that is probed by 55.m10280_at is highly expressed in the knockout relative to the wild type, but doesn’t revert to wild type levels in the complemented strain.

To determine the gene name of the probe used in this experiment, visit the corresponding platform record for this series. (You can find this by searching the Series accession in GEO datasets, and using the platform filter.) Then, scroll to the bottom of the page to see the platform data table, which gives probe IDs, identifiers for genes in the toxoplasma genome database, annotation,chromosome location and a description of the gene function if available. Search for the probe id in the table to find the corresponding gene. In this case, it’s a thioredoxin domain-containing protein.

But what if your gene of interest is not in the top 250? You can use the Profile Graph tab to search by probe id.

GEO2R also has basic QC tools. You can see the value distributions across samples to identify large scale problems in the dataset using the value distribution tab:

value distribution.png

Finally, you can retrieve the R script for the analyses run in the R script tab.

NCBI also has a comprehensive tutorial on this tool if you’re interested.

Let me know if you have any questions!

  • C. Tobin Magle, PhD, Biomedical Science Research Support Specialist

* For novice users, the default settings are a good place to start. The calculation uses the limma package, and you can view and change the default settings by clicking the Options tab.

options tab

** knocked out gene added back in to account for off target effects of the knockout.

Bioinformatics Bites: Expression of a single gene in GEO Profiles

This weeks bioinformatics bite will answer a question from the end of last week’s post:

What is the expression of ITGA2 gene in prostate cancer cells?

I’d check GEO profiles, because it is a gene centric question. Let’s start by typing in prostate cancer in the GEO profiles search box.

GEOProfiles Results

Note the link back to the GEO data set that this gene profile was derived from (green circle). You can also see the platform and the specific probe that measures this gene (orange box). Finally, you can see a cartoon of the expression level between sample and control on the right side of the record.

To narrow down your search to a specific gene, use the filters on the left to select an organism (red box) and a Gene symbol (blue box, make sure to check that you’re using the correct gene symbol). Let’s look for the ITGA2 gene in human derived samples.

GEOProfiles Results Filtered

After applying the filters (blue and red boxes), you can see the search strategy in the Search details (orange box).

To get a close up of the expression graph, click the cartoon on the right (green circle).

GEOProfiles Expression

At a glance, you can see that ITGA2 expression goes up when the microarray miR-205 is expressed (red bars). It also indicates how highly this gene is expressed relative to other genes from the same sample by percentile (blue squares).  It also lists the expression values from each sample in a table below, along with its rank.

If you need more information about how the samples were prepared, you can click on the GSM number in the table. From there, you can access general information about the experiment by clicking on the Series ID (GSE) on any sample page, or the original GEO profile record.

But what do you do if the gene you’re attempting to access data in an uncurated data set? NCBI has a tool for that: GEO2R. We will discuss how to use this tool next time.

  • C. Tobin Magle, PhD, Biomedical Science Research Support Specialist

Bioinformatics Bites: the GEO Databases

Wow! It’s been a long time since i’ve been able to take the time to write a post. I apologize. Sorry for the hiatus. I have been traveling and playing catchup and attending meetings post travel. I hope to get back to my weekly posts now.

This week’s bioinformatics bites addresses finding gene expression information using GEO profiles.

Background: The Gene Expression Omnibus (GEO) was created by the NCBI to store gene expression data from microarray experiments. Most of the content is still microarrays, but some NGS data is also present (with the raw reads in the SRA database). Additionally, it now contains other types of high-throughput genomics data, like CHiP-chip.

GEO exists as two separate databases:

  • GEO DataSets: original submitter-supplied records and curated data sets
  • GEO Profiles: Expression profiles of individual genes from curated GEO data sets

GEO Data Sets, which are labeled with GDS numbers, have three major components submitted:

  • Series (GSE) – List of expression profiles that conducted for the experiment (test, control, replicates)
  • Samples (GDS) – Information about the biological samples used in the experiments, including extraction procedures
  • Platforms (GPL) – What platform the samples were run on (like Affymetrix Mouse Genome 430 2.0 Array)

All 3 of these components are necessary to make use of the study. Both samples and series link to the CEL files, and platform gives you information about the chip the samples were run on. NCBI is working to assemble all of the components from each submitted study into a curated DataSet, but there is some lag in the process. NGS studies can’t be curated at this time.

After curation, the data sets are broken out by gene instead of experiment and the data are loaded into GEO profiles.

if you do a quick search for “cancer”, here are what the results look like.

GEO data Set Results GEO DataSets results

This looks a lot like the output of a Gene search, but with different filters. You can see in the red box that there are over 1000 cancer data sets in GEO, and that the top data set has 6 samples by following the red arrow. You can filter by study type (blue box), things like tissue or strain (purple box), or Organism (orange arrow). You also get a search details box, which shows that MeSH terms are applied to your text search, just like PubMed.

Now that we have a general idea of how the data bases are structures, I will answer a practical question in next week’s post:

What is the expression of ITGA2 gene in prostate cancer cells?

See you next week

  • C. Tobin Magle, Biomedical Sciences Research Support Specialist

Bioinformatics Bites: Repurposing publicly available data

I’m going in a bit of a different direction with this bioinformatics bite segment. Instead of explicitly describing how to use a database or tool, I wanted to tell you all about a little project that I’m working on with a vet student from CSU.

Just because I don’t have a lab or large amounts of research funding doesn’t mean I can’t do science! There’s a wealth of bioinformatic data publicly available online and some user friendly tools that are available. We’re using these freely available resources to ask questions about how the distribution of microbes in the environment correlates with specific landmarks.

Our research question is as follows: Are there differences in microbial populations at sites that are close to zoos in the NYC area compared to the farther away. To address this question, we are using data from the PathoMap project, which swabbed surfaces in transit stops all over New York City. (This project is also being expanded to the top 10 cities worldwide for public transit ridership by a project called MetaSub.) See this publication for more information.

pathomap Pathomap main page

We determined test and control sites by looking up how city planners determine the distance people are willing to walk to a transit stop. We are also using a tool called GeneGis 2 to analyze and visualize these data. See this publication for more information on GenGIS.

gengisGenGIS output from their wiki

Repurposing data forces you to think differently. Our research question was designed in the context of what data and tools were publicly available. This type of research will only get easier as the mindset of creating well-designed community resources expands. Initiatives like Big Data to Knowledge (BD2K) and the Center for Open Science are driving this new trend.

If you’re curious about data repurposing, or how to find datasets and tools, please see the bioinformatics section of our new Research Support Pages or set up a consultation to discuss your questions.

  • Tobin Magle, biomedical sciences research support specialist.

Bioinformatics Bites: Creating a custom database to search with BLAST

This week’s Bioinformatics Bite will show you how to run a BLAST search in a custom database that you create using Entrez queries. Also, if you’re interested, NCBI has a YouTube video that covers this topic.

To review, there are 2 ways to search the NCBI databases:

  1. Sequence homology via BLAST
  2. Text via Entrez

Remember Entrez? This search engine allows you to apply field tags to refine your search results. These field tags vary by database, which makes sense because different datatypes necessitate different contextual fields. For example, you can find the field tags for the sequence databases (Nucleotide, Protein, GSS, ESThere.

Let’s go back to our Constructing a BLAST Query post from last month. We’re going to focus on

Step 4: Choose your database (search set)

Here are my parameters:

  1. Algorithm typenucleotide BLAST
  2. Query: NM_001182936.1(Saccharomyces cerevisiae S288c Ras family GTPase RAS2)
  3. Search name: yeast ras2
  4. Database: refseq_RNA
  5. Specific algorithm: blastn (somewhat similar sequences)

I’m starting with a pre-made database and we’ll refine from there. Enter this information into nucleotide BLAST query page and hit BLAST.

Here is a graphic summary of the results:

ras2 alingment all(click to enlarge)

Let’s take a look at the results taxonomically by clicking Distance Tree of Results.

open distance tree

This action will open a new window.


The results are from 3 taxonomic groups: yeast, animals and protists. Let’s zero in on the animal group.

First, return to the original BLAST page.

Then, change the view so we only see sequences from animals by clicking Formatting options above the search summary and typing animals in the Organism field in the Limit results section. The taxid for animals will autofill as you type.

blast formatting

The top result from Drosophila wilistoni. Keep that e-value in mind.

dros eval format animals

Now, instead of filtering the search results from the refseqRNA database, let’s create a custom search set. First, click Edit and Resubmit above the search summary.

edit and resubmit

Now, limit the search to animals entering animals in the Organism field. This parameter reduces the number of records in the search set.

choose search set

Now, let’s look at the e-value from Drosophila wilistoni again.

dros eval animals only

It changed from 2e-47 to 1e-47, but all of the other values in the table stayed the same. it changes because the the e-value is dependent on the size of the search dataset. The larger the dataset, the more likely that you’ll get a coincidental match even though the sequences are related. Hence, the e-value went down as the size of the database went down.

I hope this tutorial has illustrated the importance of creating custom search sets and recording algorithm parameters.

  • Tobin Magle, Biomedical Sciences Research Support Specialist

Bioinformatics Bites: PubChem

This week’s bioinformatics bite is being outsourced to the NCBI Insights blog. Enjoy a concise explanation about how to identifying chemical targets to find cross reactions and prevent drug side effects:

  • Have a nice long weekend,

Tobin Magle, PhD, Biomedical sciences research support specialist

NCBI cracks 200 annotated eukaryotic genomes

The National Center for Biotechnology Information (NCBI) is the central repository for molecular data in the US. They don’t generate their own data: all of the sequences in their databases is submitted by external sources such as research labs, genomic sequencing consortia, and through the INSDC.

They do, however, provide an interface to access these data and create tools to analyze the data, which includes annotation a select set of genomes. Because NIH is primarily devoted to human health, they prioritize on eukaryotic genomes, especially mammals.

The NCBI has annotated over 200 eukaryotic genomes so far. Do you want to see your favorite organism annotated by NCBI? They take requests through their Help Desk.

  • Tobin Magle, PhD- Biomedical Sciences Research Support Specialist.

Bioinformatics Bites: Primer BLAST

This week’s bioinformatics bites is going to look over the features of Primer BLAST.

Back in my day (circa 2001-2014), we designed our primers by hand! Most of the places I worked, we’d have a printout of the genomic sequence we were working on that was annotated with transcripts and restriction site. We’d eyeball a good primer site and use OligoAnalyzer to find Tm and hairpins and self annealing and all that. One group I worked in would calculate Tm by counting up all the G’s and multiplying that by 4, then counting all the As and multiplying that by 2, and adding those two numbers together. There was no thought given to contaminating products in the design process. That was all trial and error in the PCR machine.

It was the dark ages.

Luckily, NCBI designed primer BLAST to help you all out. The primary function of this tool is primer design. First, enter a template.


(Click to enlarge images)

First, provide a unique identifier (accession or GI) for a record that’s in GenBank (like the NM_ number for an mRNA) or paste in a nucleotide sequence in FASTA format. i’m going to use the accession number (NM_001302688.1) for the human APOE gene mRNA.

Then, specify where you want your forward (sense) and reverse (antisense) primers to be in that sequence.

Finally, specify information about your primers (Tm) and desired PCR product (length). You can also use this tool to create a matching primer for a preexisting primer. We’ll use the example  5′-GGGAGCCCTATAATTGGACAAG-3′


If you used a refseq mRNA as your template, specify parameters involving introns and exons, such as whether the  primers span an exon-exon junction and how many bases have to match on either side of the junction. You can also specify whether you want the product to span an intron on the genomic DNA.

Primer blast exonintron

Finally, specify how you want the algorithm to check for specificity of your primers by selecting your organism (or possible contaminating organism), the datta base you want to search, target size and specific primer parameters (mismatches etc.) PrimerBLAST specificity

For this demo, I used the template NM_001302688.1 and specified that my forward primer is 5′-GGGAGCCCTATAATTGGACAAG-3′. I also specified that I want the product include on intron. All other values were kept as default. (Another handy feature is that the interface highlights non-default parameters in yellow.)

After clicking submit, a nice graphical summary of the results is displayed. Note that because i specified a forward primer, all the PCR products (in blue) start in the same place:

primer BLAST graphic

For each primer pair, the sequence of both primers, the product length, the size of the intron they span are included. along with information about Tm, length, start and stop positions, and self binding for each primer.

primer blast primer pair

This is where the hack I mentioned last week comes in: because the primer that we entered into the search is also mapped onto the template, Primer BLAST effectively tells you the binding sure of the primer in terms of its position on the transcript we entered as the template. Pretty cool! Changing the template to the DNA accession number that contains this sequence would give us the genomic position of the primer, but we wouldn’t be able to do the cool stuff with introns.

Finally, it displays potential contaminating products you might see with human DNA as the PCR template:

primer blast unintended targets

In this case, it looks like we’d be amplifying another transcript variant of APOE. Definitely something to look out for when doing qPCR!

I hope you find this tool useful. Please let me know if I can help you with using primer BLAST or any other NCBI databases and tools.

  • Tobin Magle, Biomedical Sciences Research Support Specialist

Bioinformatics bites: How do I find primer binding sites?

This week’s bioinformatics bite comes from another actual patron question (paraphrased):

I have all these primers that someone else designed. How to I figure out where they bind and what they amplify?

Disclaimer: this isn’t actually the answer I gave to the person seeking help, but I’ve since found a more efficient tool.

Probably the fastest way to get this information is to use a simple tool called Primer Map.

Conveniently, they have an example primer mapping loaded into the browser:

Map these Primers:

(reverse) aacagctatgaccatg,
(T3) attaaccctcactaaag,
(KS) cgaggtcgacggtatcg,
(SK) tctagaactagtggatc,
(T7) aatacgactcactatag,
(-40) gttttcccagtcacgac,
(Sp6) atttaggtgacactatag,
(M13 for) gtaaaacgacggccagt,
(M13 rev) cacacaggaaacagctatgaccat,
(BGH rev) tagaaggcacagtcgagg,
(pGEX for) ctggcaagccacgtttggtg,
(pGEX rev) ggagctgcatgtgtcagagg,
(T7-EEV aaggctagagtacttaatacga,
(pUC/M13 Forward) gttttcccagtcacgac,
(pUC/M13 forward) cgccagggttttcccagtcacgac,
(pUC/M13 reverse) caggaaacagctatgac,
(pUC/M13 reverse) tcacacaggaaacagctatgac,
(Glprimer1) tgtatcttatggtactgtaactg,
(GLprimer2) ctttatgtttttggcgtcttcca,
(RVprimer3) ctagcaaaataggctgtccc,
(RVprimer4) gacgatagtcatgccccgcg,
(Lambda gt11 Forward) ggtggcgacgactcctggagcccg,
(Lambda gt11 Reverse) ttgacaccagaccaactggtaatg,
(Lambda gt10 Forward) cttttgagcaagttcagcctggttaag,
(lambda gt10 Reverse) gaggtggcttatgagtatttcttccagggta,
(Pinpoint Sequencing) cgtgacgcggtgcagggcg,
(pTarget Sequencing) ttacgccaagttatttaggtgaca

To this sequence:

>sample sequence


Once the template is in the top box, and the primers are in the bottom box, hit Submit. (The output gets a lot cleaner if you turn translation and restriction enzyme displays off in the settings.)

The results pop up in a new window. The first results section show where the primers bind, with forward (sense) primers highlighted in purple and reverse (antisense) primers highlighted in orange.


The second part of the results page show a table with all the primers that you input, highlighting which ones that bound with the color that indicates their orientation:


The only thing that is missing is a their column that indicates the position on the template at which the primers bind. I guess beggars can’t be choosers though.

Next time I’ll show you how to do this using NCBI’s Primer BLAST. This algorithm is actually build for primer design, but it can be hacked for this purpose and provides better visualizations and more information.

  • Tobin Magle, PhD, Biomedical Sciences Research Support Specialist

Bioinformatics Bites: R tutorials for beginners

Today’s Bioinformatics Bite is based on a hypothetical question that I think a lot of people are afraid to ask:

I hear R is a great tool for doing bioinformatic analysis, but I have no idea how to code. How can I get started?

Well, I’d say the first step is to Install R.

This first installation installs the R coding language and a bare-bones editor to write and run code in.

If you want a nicer interface, I’d suggest installing R studio. which has a lot of bells and whistles that will made using R a lot easier. R Studio contains a text editor with highlighting, integrated help functions, an environment window that reminds you what variables you created and a console that allows you to execute your code right from the editor.


But how do you even know where to begin now that you have it installed? A good place to start for R basics is Swirl. Swirl is a tutorial system that you can use INSIDE R. All you have to do to install Swirl is type


inside the R console, and swirl will be automatically installed!

Now to run Swirl, just type

> library("swirl")
> swirl()

in the R console to load the swirl library and run the tutorials. Then you can pick from a variety of introductory tutorials that are closely linked to courses in the Johns Hopkins University Data Science Specialization on Coursera. Now we just need to get someone to write some Swirl tutorials for Bioconductor.

If you have any questions about how to set up R, don’t hesitate to Ask.

Tobin Magle,

Biomedical Sciences Research Support Specialist

Bioinformatics bites: Navigating BLAST results

In this week’s bioinformatics Bite will pick up where last week’s post left off: the results page after you hit the BLAST button.

The results page has several sections that we will go through individually:

  1. Search details: What you entered into your query
  2. Graphic Summary: A visual representation comparing your query to the results
  3. Descriptions: Details about what the results are and how well they match the Query
  4. Alignments: The actual alignment of your query with each result (Subject)

Search details

BLAST search details

(Click to enlarge)

The search details are essentially a recap of what parameters you entered into your search (name of your search, molecule type, query length, database name, the algorithm/program that you ran) but it also gives your search results a unique ID called an RID. This is a temporary link to get back to these search results just in case your computer crashes or you close the window, etc. You can get back to them by going to the BLAST main page and clicking on the gray “Recent Results” tab. It also creates a temporary query ID. If you prefer a video tutorial, also notice that there’s a link to a YouTube video about how to read the BLAST result page. Note that this section is also where you can edit and save your search.

Graphic Summary


The Graphic Summary visualizes how the search results align with your query sequence. The colored boxes at the top are a key for the alignment scores for each search results. Higher numbers (red) are better. The thick red bar under the color key represents the Query. The thinner lines here are how the search results align with the query. The first (best) matches align the while length of the query, while the last few are missing some base pairs at the end.

If you click on one of the red results bars, the screen will jump to the actual alignment for that result and your query, but first let’s look at the Descriptions.


BLAST result descriptions

The Descriptions section contains a table with several informative columns

  1. Description: how the result is annotated in the database
  2. Max Score: alignment score for the best matched segment
  3. Total Score: alignment score for the whole result
  4. Query Cover: how much of your query is included in the alignment
  5. E value: “expect value”, probability of a false positive
  6. Identity: %nucleotides that are identical between query and result
  7. Accession: the unique identifier for the result sequence (with link)

The max and total scores (2&3) refer to similarity scores, which are a measure of how well the query and subject match. Usually the score is calculated by adding points for all bases that match and subtracting points for mismatches and gaps. I will cover Similarity scores in more depth in another post about algorithm parameters. As you can probably tell, if the subject and query match reasonably well, then the longer the sequence, the higher the score. This fact means that you can’t compare similarity scores between different BLAST queries. It a way of ranking results within one search.

Query cover is the % of the query that is matched by the subject. In this case, all but the last 4 are 100% because they cover the whole query.

The E value, aka the “expect value”, is the number of matches you’d expect to get by random chance in a given database/query combo, or a false positive. Because BLAST is meant to get at evolutionary relationships among sequences, another way of explaining an E value is the likelihood that you got this search result even though the subject and the query are not evolutionarily related given your BLAST query.

The identity is the % nucleotides that are exact matches between the subject and the query.

Finally, the Accession number is the unique identifier for the search result, with a link. This link maps to the entire sequence, not just the part that matches. For example, it would pull up the entire contig for a genomic hit or a whole transcript for an mRNA hit.

If you click on the description for a hit, it will move the page down to the alignment.


BLAST AlignmentThe alignment lines up the query (top) and the subject (bottom) base by base along the whole length of the query. Vertical lines indicate exact matches. Horizontal lines indicates gaps in the sequence. The numbers on the sides of the alignment refer to the base position of each sequence.

One of the most useful features here is the related information section on the right. This will have links to other NCBI databases like Gene where you can find more information about a given search result.

Hope this was useful. We’re not done with BLAST yet though. Upcoming posts will discuss filtering your search results, adjusting algorithm parameters, saving BLAST searches, and creating custom search databases.

-Tobin Magle, Biomedical Sciences Research support specialist.