AWS-iGenomes - Common reference genomes hosted on AWS S3

Introduction

In NGS bioinformatics, a typical analysis run involves aligning raw DNA sequencing reads against a known reference genome. A different reference is needed for every species, and many species have several references to choose from. Each tool then builds its own indices against these references. As such, one analysis run typically requires a number of different files. For example: raw underlying DNA sequence, annotation (GTF files) and index file for use the chosen alignment tool.

These files are quite large and take time to generate. When running in the cloud, downloading and building them for each AWS run often takes a significant of the total run time and resources, which is very wasteful. To help with this, we have created an AWS S3 bucket containing the illumina iGenomes references, with a few additional indices for a extra tools on top of this base dataset. The iGeomes initiative aims to collect and standardise a number of common species, references and tool indices.

This data is hosted in an S3 bucket (~5TB) and crucially is uncompressed (unlike the .tar.gz files held on the illumina iGenomes FTP servers). AWS runs can by pull just the required files to their local file storage before running. This has the advantage of being faster, cheaper and more reproducible.

For more details about what’s contained in this data repository, please see the GitHub readme.

Download Script

To make usage easier, this repository contains a script (aws-igenomes.sh) which can sync the AWS-iGenomes for you. It requires the AWS command line tools to be installed and configured with authentication. Required references can be supplied on the command line or given through prompts when running the script.

This repository is hosted using GitHub pages, so the script can be run in a single command as follows:

curl -fsSL https://ewels.github.io/AWS-iGenomes/aws-igenomes.sh | bash

For more details, see https://ewels.github.io/AWS-iGenomes/

Command Builder

If you’d prefer to just get a sync command for the files you need, you can use the web-based command builder that’s available at https://ewels.github.io/AWS-iGenomes/

Credits

The AWS s3 hosting charges are kindly paid for with a grant from AWS Programs for Research and Education. The iGenomes resource was created by illumina. All credit for the collection and standardisation of this data should go to them!

This S3 resource was set up and documented by Phil Ewels (@ewels). The additional references not found in the base iGenomes resource were created with the help of Wesley Schaal (@wschaal) – a system administrator at UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science).

The resource was initially developed for use at the National Genomics Infrastructure at SciLifeLab in Stockholm, Sweden.

Contributors

Phil Ewels

Licence

MIT

See the code for AWS-iGenomes here: https://github.com/ewels/AWS-iGenomes