Example: download and search 66,000 GTDB genomes with a query genome

sourmash-bio/sourmash-examples#13


You'll need to build the genome signature file in Example: create a signature by downloading and sketching a genome sequence first.

Then, download the GTDB genomic representatives database:

curl -JLO [https://osf.io/3a6gn/download](https://osf.io/3a6gn/download)

This will create a 1.7 GB file, gtdb-rs207.genomic-reps.dna.k31.zip, which contains 66,000 genome sketches from the Genome Taxonomy Database, release 207.

Now search the genome against the GTDB database:

sourmash search GCF_000005845.2_ASM584v2_genomic.fna.gz.sig gtdb-rs207.genomic-reps.dna.k31.zip

This will take about 5 minutes.

The output will look like this:

8 matches; showing first 3:
similarity   match
----------   -----
 29.9%       GCF_003697165.2 Escherichia coli DSM 30083 = JCM 1649 = A...
 14.6%       GCF_002965065.1 Escherichia sp. MOD1-EC7003 strain=MOD1-E...
 14.2%       GCF_000026225.1 Escherichia fergusonii ATCC 35469 strain=...

showing that this genome is, indeed, an E. coli genome :).

The similarity in the left column is Jaccard similarity, calculated using the k-mers in the query genome sketch against the k-mers in each of the database genome sketches.

You can increase the number of output results with -n:

8 matches:
similarity   match
----------   -----
 29.9%       GCF_003697165.2 Escherichia coli DSM 30083 = JCM 1649 = A...
 14.6%       GCF_002965065.1 Escherichia sp. MOD1-EC7003 strain=MOD1-E...
 14.2%       GCF_000026225.1 Escherichia fergusonii ATCC 35469 strain=...
 14.1%       GCF_902498915.1 Escherichia ruysiae, OPT1704
 14.1%       GCF_004211955.1 Escherichia sp. E1V33 strain=E1V33, ASM42...
 13.5%       GCF_005843885.1 Escherichia sp. E4742 strain=E4742, ASM58...
 10.3%       GCF_001660175.1 Escherichia sp. B1147 strain=B1147, ASM16...
 10.1%       GCF_011881725.1 Escherichia coli strain=SCPM-O-B-8794, AS...

and you can record the results in a CSV file with -o <output.csv>.

Categories

This example belongs to the following categories: