Sarek currently uses GRCh38 by default.
The settings are in genomes.config, they can be tailored to your needs.
The build.nf script is used to build the indexes for the reference test.
Use --genome GRCh37 to map against GRCh37.
Before doing so and if you are not on UPPMAX, you need to adjust the settings in genomes.config to your needs.
To get the needed files, download the GATK bundle for GRCh37.
The following files need to be downloaded:
- 242c0df2a698a76fc43bdd938ba57c62 - '1000G_phase1.indels.b37.vcf.gz'
- 00b0e74e4a13536dd6c0728c66db43f3 - 'dbsnp_138.b37.vcf.gz'
- dd05833f18c22cc501e3e31406d140b0 - 'human_g1k_v37_decoy.fasta.gz'
- a0764a80311aee369375c5c7dda7e266 - 'Mills_and_1000G_gold_standard.indels.b37.vcf.gz'
From our repo, get the intervals list file.
More information about this file in the intervals documentation
Description of how to generate the Loci file used in the ASCAT process is described here.
Use --genome GRCh38 to map against GRCh38.
Before doing so and if you are not on UPPMAX, you need to adjust the settings in genomes.config to your needs.
To get the needed files, download the GATK bundle for GRCh38 from ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/. You can also download the required files from the Google Cloud mirror link here.
The MD5SUM of Homo_sapiens_assembly38.fasta included in that file is 7ff134953dcca8c8997453bbb80b6b5e.
If you download the data from the FTP servers beta/ directory, which seems to be an older version of the bundle, only Homo_sapiens_assembly38.known_indels.vcf is needed.
Also, you can omit dbsnp_138_ and dbsnp_144 files as we use dbsnp_146.
The old ones also use the wrong chromosome naming convention.
The Google Cloud mirror has all data in the v0 directory, but requires you to remove the resources_broad_hg38_v0_ prefixes from all files.
The following files need to be downloaded:
- 3884c62eb0e53fa92459ed9bff133ae6 - 'Homo_sapiens_assembly38.dict'
- 7ff134953dcca8c8997453bbb80b6b5e - 'Homo_sapiens_assembly38.fasta'
- b07e65aa4425bc365141756f5c98328c - 'Homo_sapiens_assembly38.fasta.64.alt'
- e4dc4fdb7358198e0847106599520aa9 - 'Homo_sapiens_assembly38.fasta.64.amb'
- af611ed0bb9487fb1ba4aa1a7e7ad21c - 'Homo_sapiens_assembly38.fasta.64.ann'
- d41d8cd98f00b204e9800998ecf8427e - 'Homo_sapiens_assembly38.fasta.64.bwt'
- 178862a79b043a2f974ef10e3877ef86 - 'Homo_sapiens_assembly38.fasta.64.pac'
- 91a5d5ed3986db8a74782e5f4519eb5f - 'Homo_sapiens_assembly38.fasta.64.sa'
- f76371b113734a56cde236bc0372de0a - 'Homo_sapiens_assembly38.fasta.fai'
- 14cc588a271951ac1806f9be895fb51f - 'Homo_sapiens_assembly38.known_indels.vcf.gz'
- 1a55fdfa6533ae5cbc70e8188e779229 - 'Homo_sapiens_assembly38.known_indels.vcf.gz.tbi'
- 2e02696032dcfe95ff0324f4a13508e3 - 'Mills_and_1000G_gold_standard.indels.hg38.vcf.gz'
- 4c807e2cbe0752c0c44ac82ff3b52025 - 'Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi'
If you just downloaded the Homo_sapiens_assembly38.fasta.gz file, you would need to do:
gunzip Homo_sapiens_assembly38.fasta.gz
bwa index -6 Homo_sapiens_assembly38.fasta
Description of how to generate the Loci file used in the ASCAT process is described here.
Use --genome smallGRCh37 to map against a small reference genome based on GRCh37.
smallGRCh37 is the default genome for the testing profile (-profile testing).
Sarek is using AWS iGenomes, which facilitate storing and sharing references.
Both GRCh37 and GRCh38 are available with --genome GRCh37 or --genome GRCh38 respectively with any profile using the conf/igenomes.config file (eg.: awsbatch, or btb), or you can specify it with -c conf/igenomes.config, it contains all data previously detailed.
The build.nf script can build the files needed for smallGRCh37.
Use --refDir <path to references> to specify where are the files to process.
nextflow run build.nf --refDir <path to references>