Example: detect potential contaminants in unassembled and uncorrected PacBio sequencing data

sourmash-bio/sourmash-examples#22


A colleague was having some trouble mapping all their PacBio reads to the intended genome, so I decided to search it against the prepared Genbank databases with a k-mer size of 21:

# construct a signature
sourmash sketch dna -p k=21 reads.fastq -o reads.sig

# search against genbank using 'prefetch' with a low threshold to identify small overlaps
sourmash prefetch reads.sig genbank*-k21.zip -o reads.prefetch.csv --threshold-bp=0

I then analyzed the output CSV like so:

csvtk cut -f intersect_bp,match_name reads.prefetch.csv | sort -n | tail

to get the 10 largest matches (this uses csvtk).

The output was the following:

19000,"GCA_002018415.1 Enterobacter kobei strain=ICBECLAM6, ASM201841v1"
19000,"GCA_003383805.1 Staphylococcus pseudintermedius strain=ST68, ASM338380v1"
19000,"GCA_019057835.1 Candidatus Lokiarchaeota archaeon, ASM1905783v1"
19000,"GCA_020721325.1 Paludibacter propionicigenes, ASM2072132v1"
21000,"GCA_019389785.1 Streptomyces griseus subsp. griseus strain=51d, ASM1938978v1"
22000,"GCA_005787405.1 Rickettsiales bacterium, ASM578740v1"
23000,"GCA_001653815.2 Opitutaceae bacterium TAV3 strain=TAV3, ASM165381v2"
32000,"GCA_018274365.1 Escherichia coli strain=MTR_BAU01, ASM1827436v1"
36000,"GCA_913774445.1 uncultured Acinetobacter sp., maxbin_SP184_1.002_sub"
39000,"GCA_913774065.1 uncultured Acinetobacter sp., maxbin_SP424_1.012_sub_cleaned"

suggesting that there are substantial potential overlaps (~39kb and longer) with an Acinetobacter genome.

Categories

This example belongs to the following categories: