NCBIx::BigFetch is a Perl module useful for downloading very large result sets of sequences from NCBI given a text query. Its first use had over 11,000,000 sequences as the result of a single keyword search. It uses YAML to create a configuration file to maintain project state in case network or server issues interrupts execution, in which case it may be easily restarted after the last batch.
Downloaded data is organized by "project id" and "base directory" and saved in text files. Each file includes the project id in its name. The project_id and base_dir keys are the only required keys, although you will get the same search for "apoptosis" everytime unless you also set the "query" key. In any case, once a project is started, it only needs the two parameters to be reloaded.
Besides the data files, two other files are saved: 1) the initial search result, which includes the WebEnv key, and 2) a configuration file, which saves the parsed data and is used to pick-up the download and recover missing batches or sequences.
Results are retrived in batches depending on the "return_max" key. By default, the "index" starts at 1 and downloads continue until the index exceedes "count".
Occasionally errors happen and entire batches are not downloaded. In this case, the "index" is added to the "missing" list. This list is saved in the configuration file. The missing batches should be downloaded every day, and not saved until the end of the complete run.
Working scripts are included in the script directory:
fetch-all.pp
fetch-missing.pp
fetch-unavailable.pp
The recommended workflow is:
1. Copy the scripts and edit them for a specific project. Use
a new number as the project ID.
2. Begin downloading by running fetch-all.pp, which will first
submit a query and save the resulting WebEnv key in a project
specific configuration file (using YAML).
3. The next morning, kill the fetch-all.pp process and run
fetch-missing.pp until it completes.
4. Restart fetch-all.pp.
If you wish to re-download "not available" sequences, you may run fetch-unavailable.pp. However, they will be downloaded at the end of fetch-all.pp if it completes normally.
If your query result set is so large that your WebEnv times out, simply start a new project with that last index of the previous project, and it will pick up the result set from there (with a new WebEnv). (Planned upgrade will automagically start another search.)
Warning: You may lose a (very) few sequences if your download extends across multiple projects. However, our testing shows that the batches generated with the same query within a few days of each other are largely identical.
SYNOPSIS
use NCBIx::BigFetch;
# Parameters
my $params = { project_id => "1",
base_dir => "/home/user/data",
db => "protein",
query => "apoptosis",
return_max => "500" };
# Start project
my $project = NCBIx::BigFetch->new( $params );
# Love the one you're with
print " AUTHORS: " . $project->authors() . "\n";
# Attempt all batches of sequences
while ( $project->results_waiting() ) { $project->get_next_batch(); }
# Get missing batches
while ( $project->missing_batches() ) { $project->get_missing_batch(); }
# Find unavailable ids
my $ids = $project->unavailable_ids();
# Retrieve unavailable ids
foreach my $id ( @$ids ) { $project->get_sequence( $id ); }
Requirements:
· Perl