This week’s Bioinformatics Bite will show you how to run a BLAST search in a custom database that you create using Entrez queries. Also, if you’re interested, NCBI has a YouTube video that covers this topic.
To review, there are 2 ways to search the NCBI databases:
Remember Entrez? This search engine allows you to apply field tags to refine your search results. These field tags vary by database, which makes sense because different datatypes necessitate different contextual fields. For example, you can find the field tags for the sequence databases (Nucleotide, Protein, GSS, EST) here.
Let’s go back to our Constructing a BLAST Query post from last month. We’re going to focus on
Step 4: Choose your database (search set)
Here are my parameters:
- Algorithm type: nucleotide BLAST
- Query: NM_001182936.1(Saccharomyces cerevisiae S288c Ras family GTPase RAS2)
- Search name: yeast ras2
- Database: refseq_RNA
- Specific algorithm: blastn (somewhat similar sequences)
I’m starting with a pre-made database and we’ll refine from there. Enter this information into nucleotide BLAST query page and hit BLAST.
Let’s take a look at the results taxonomically by clicking Distance Tree of Results.
This action will open a new window.
The results are from 3 taxonomic groups: yeast, animals and protists. Let’s zero in on the animal group.
First, return to the original BLAST page.
Then, change the view so we only see sequences from animals by clicking Formatting options above the search summary and typing animals in the Organism field in the Limit results section. The taxid for animals will autofill as you type.
The top result from Drosophila wilistoni. Keep that e-value in mind.
Now, instead of filtering the search results from the refseqRNA database, let’s create a custom search set. First, click Edit and Resubmit above the search summary.
Now, limit the search to animals entering animals in the Organism field. This parameter reduces the number of records in the search set.
Now, let’s look at the e-value from Drosophila wilistoni again.
It changed from 2e-47 to 1e-47, but all of the other values in the table stayed the same. it changes because the the e-value is dependent on the size of the search dataset. The larger the dataset, the more likely that you’ll get a coincidental match even though the sequences are related. Hence, the e-value went down as the size of the database went down.
I hope this tutorial has illustrated the importance of creating custom search sets and recording algorithm parameters.
- Tobin Magle, Biomedical Sciences Research Support Specialist