How to get Accession numbers by sequences or "gi numbers" in Genbank in a batch-job manner?


#1

Hi geniuses,

A really big headache issue haunting me recently, required your insightful suggestions to help me out:

I have a large matrix, about 12,000 taxa, and the data is formalized as below:

Balanopaceae_Balanops_pachyphylla_ti1241920_gi408902408 ATGAGAATCAATCCTACTNNNNNNACTTCTGGTCCTGGAGTTTCCGCGCTTGAAAAAAAGAA

So using information like “taxa name or gi No.”, how can I get the corresponding Accession number from each taxon in this large matrix via NCBI website as a batch job?

Any ideas or experience to share?

Thanks!

Miao


#2

One of the most efficient ways to do such things is to use Perl and Bioperl. You can see examples here of people looking to do similar things: https://www.biostars.org/p/65102/ Extremely powerful!

Maybe someone else can come up with a “scripting-less” approach.

Good luck,

Laura


#3

Thank you very much Laura for your insightful suggestion!

And those examples are very good, I’ll start from perl.

Thanks again!

Miao


#4

Well, I really find a a “scripting-less” approach in NCBI website: http://www.ncbi.nlm.nih.gov/sites/batchentrez

Hope it will be helpful for someone in future.

Miao


#5

The other option, which I find very straightforward, is to use eBot from NCBI to create a perl script (http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi). You can query Entrez databases and select the type of output you prefer. Often times I parsed the outputs using regex.

Alternatively, NCBI published some standalone tools exactly for batch queries recently (http://www.ncbi.nlm.nih.gov/news/02-06-2014-entrez-direct-released/)

Good Luck

Eduardo


#6

Better!

Thanks!

Miao