Example: use picklists and manifests to work with a small subset of a large database
sourmash-bio/sourmash-examples#4
Suppose you want to search, compare, or otherwise work with only a small set of genomes from GTDB - perhaps only the ones with "Shewanella" in their name.
sourmash makes this easy!
Start by downloading the GTDB genomic representatives database from the prepared databases.
curl -JLO [https://osf.io/3a6gn/download](https://osf.io/3a6gn/download)
This will download a 1.7 GB file named gtdb-rs207.genomic-reps.dna.k31.zip
.
Extract a manifest
Now extract a manifest containing all of the metadata about signatures in this zip file:
sourmash sig manifest gtdb-rs207.genomic-reps.dna.k31.zip -o gtdb.mf --no-rebuild
(Here, the --no-rebuild
just uses the manifest included in the zip file, rather than regenerating it from scratch.)
Use csvtk to extract the name
column and then use grep
to select only those with Shewanella
in their name -
csvtk cut -f name gtdb.mf | grep -i Shewanella > shew.names.csv
You should see that you have 98 matches:
wc -l shew.names.csv
Make a picklist from the manifest row names
Turn this list into a picklist file by providing a column header:
echo name > shew-picklist.csv
cat shew.names.csv >> shew-picklist.csv
``
## Get a query signature
Run:
```shell
sourmash sig grep GCA_002341165 gtdb-rs207.genomic-reps.dna.k31.zip -o shew-query.sig
to pick out just one of the Shewanella signatures, to use as a search query.
Search using the picklist
Now you can search just the Shewanella genomes using the picklist:
sourmash search shew-query.sig gtdb-rs207.genomic-reps.dna.k31.zip --picklist shew-picklist.csv:name:name
This is much faster than searching the entire database (which contains 66k signatures) as long as you know you just want to search that specific list of Shewanella genomes.
Note: You could use just the space-delimited identifiers as a picklist, too, by using ident
as the column type in the argument to --picklist
, above. Please see the picklist docs for details.
Categories
This example belongs to the following categories: