mscerts.net
 
Adaptive Technologies
Adobe AIR Apps
Artistic Software
Communications
Database
Desktop Environment
Documentation
Education
Games
Home Automation
Information Management
Internet
Multimedia
Office
Printing
Programming
Religion
Science
Science and Engineering
Security
System
Terminals
Text Editing&Processing
Utilities
 
 
myGengo 1.2.0
corejet.jira 1.0 Alpha 1
jsoncmd 0.0
correct_pycountry 0.12.2
Monsters 1.0
corejet.core 1.0 Alpha 1
Scope::Container::DBI 0.04
DBIx::DataModel 1.27
Word Golf 0.2.1
Plack::Middleware::LogWarn 0.001002
 
 
 

NCBIx::BigFetch 0.56

Robustly retrieve very large NCBI sequence result sets based on keyword searches using NCBI eUtils

NCBIx::BigFetch is a Perl module useful for downloading very large result sets of sequences from NCBI given a text query. Its first use had over 11,000,000 sequences as the result of a single keyword search. It uses YAML to create a configuration file to maintain project state in case network or server issues interrupts execution, in which case it may be easily restarted after the last batch.

Downloaded data is organized by "project id" and "base directory" and saved in text files. Each file includes the project id in its name. The project_id and base_dir keys are the only required keys, although you will get the same search for "apoptosis" everytime unless you also set the "query" key. In any case, once a project is started, it only needs the two parameters to be reloaded.

Besides the data files, two other files are saved: 1) the initial search result, which includes the WebEnv key, and 2) a configuration file, which saves the parsed data and is used to pick-up the download and recover missing batches or sequences.

Results are retrived in batches depending on the "return_max" key. By default, the "index" starts at 1 and downloads continue until the index exceedes "count".

Occasionally errors happen and entire batches are not downloaded. In this case, the "index" is added to the "missing" list. This list is saved in the configuration file. The missing batches should be downloaded every day, and not saved until the end of the complete run.

Working scripts are included in the script directory:

 fetch-all.pp
 fetch-missing.pp
 fetch-unavailable.pp


The recommended workflow is:

 1. Copy the scripts and edit them for a specific project. Use
 a new number as the project ID.

 2. Begin downloading by running fetch-all.pp, which will first
 submit a query and save the resulting WebEnv key in a project
 specific configuration file (using YAML).

 3. The next morning, kill the fetch-all.pp process and run
 fetch-missing.pp until it completes.

 4. Restart fetch-all.pp.


If you wish to re-download "not available" sequences, you may run fetch-unavailable.pp. However, they will be downloaded at the end of fetch-all.pp if it completes normally.

If your query result set is so large that your WebEnv times out, simply start a new project with that last index of the previous project, and it will pick up the result set from there (with a new WebEnv). (Planned upgrade will automagically start another search.)

Warning: You may lose a (very) few sequences if your download extends across multiple projects. However, our testing shows that the batches generated with the same query within a few days of each other are largely identical.

SYNOPSIS

 use NCBIx::BigFetch;
 
 # Parameters
 my $params = { project_id => "1",
 base_dir => "/home/user/data",
 db => "protein",
 query => "apoptosis",
 return_max => "500" };
 
 # Start project
 my $project = NCBIx::BigFetch->new( $params );
 
 # Love the one you're with
 print " AUTHORS: " . $project->authors() . "\n";
 
 # Attempt all batches of sequences
 while ( $project->results_waiting() ) { $project->get_next_batch(); }
 
 # Get missing batches
 while ( $project->missing_batches() ) { $project->get_missing_batch(); }
 
 # Find unavailable ids
 my $ids = $project->unavailable_ids();
 
 # Retrieve unavailable ids
 foreach my $id ( @$ids ) { $project->get_sequence( $id ); }

Requirements:

· Perl

  Other
-   Crypt::Ctr 0.01
-   Crypt::CFB 0.01
-   CosmoloPy 0.1.001
-   altgraph 0.7.0
-   CodeTalker 1.0
-   modulegraph 0.8
-   Shabti 0.4
-   rivr 0.1.1
-   czipfile 1.0.0
-   EnrichPy 0.1.001
-   Review Pages 1.0.1
-   XML::Elemental 2.11
-   Apache::Session 1.88
-   Acme::24 0.03
-   Catalyst::Plugin::Config::JSON 0.03
-   Net::MDNS::Client 0.04
-   benchmaster 1.0 Beta 1
-   libx1f4l2 0.20100725
-   Txtlib 0.1.2
-   SerfJ 0.3.0
 
 
                mscerts.net