Bioinformatics Bites: Creating a custom database to search with BLAST

This week’s Bioinformatics Bite will show you how to run a BLAST search in a custom database that you create using Entrez queries. Also, if you’re interested, NCBI has a YouTube video that covers this topic.

To review, there are 2 ways to search the NCBI databases:

  1. Sequence homology via BLAST
  2. Text via Entrez

Remember Entrez? This search engine allows you to apply field tags to refine your search results. These field tags vary by database, which makes sense because different datatypes necessitate different contextual fields. For example, you can find the field tags for the sequence databases (Nucleotide, Protein, GSS, ESThere.

Let’s go back to our Constructing a BLAST Query post from last month. We’re going to focus on

Step 4: Choose your database (search set)

Here are my parameters:

  1. Algorithm typenucleotide BLAST
  2. Query: NM_001182936.1(Saccharomyces cerevisiae S288c Ras family GTPase RAS2)
  3. Search name: yeast ras2
  4. Database: refseq_RNA
  5. Specific algorithm: blastn (somewhat similar sequences)

I’m starting with a pre-made database and we’ll refine from there. Enter this information into nucleotide BLAST query page and hit BLAST.

Here is a graphic summary of the results:

ras2 alingment all(click to enlarge)

Let’s take a look at the results taxonomically by clicking Distance Tree of Results.

open distance tree

This action will open a new window.


The results are from 3 taxonomic groups: yeast, animals and protists. Let’s zero in on the animal group.

First, return to the original BLAST page.

Then, change the view so we only see sequences from animals by clicking Formatting options above the search summary and typing animals in the Organism field in the Limit results section. The taxid for animals will autofill as you type.

blast formatting

The top result from Drosophila wilistoni. Keep that e-value in mind.

dros eval format animals

Now, instead of filtering the search results from the refseqRNA database, let’s create a custom search set. First, click Edit and Resubmit above the search summary.

edit and resubmit

Now, limit the search to animals entering animals in the Organism field. This parameter reduces the number of records in the search set.

choose search set

Now, let’s look at the e-value from Drosophila wilistoni again.

dros eval animals only

It changed from 2e-47 to 1e-47, but all of the other values in the table stayed the same. it changes because the the e-value is dependent on the size of the search dataset. The larger the dataset, the more likely that you’ll get a coincidental match even though the sequences are related. Hence, the e-value went down as the size of the database went down.

I hope this tutorial has illustrated the importance of creating custom search sets and recording algorithm parameters.

  • Tobin Magle, Biomedical Sciences Research Support Specialist