kmerHash
Project apply kmer and minHash method for genome comparions.
# Create a conda env with python=3.7 and install packages.
conda create -n sourmash python=3.7
conda activate sourmash
conda install -c bioconda sourmash
conda install -c bioconda ncbi-genome-download
conda install -c bioconda khmer
# Downlaod genome/fasta files from ncbi
ncbi-genome-download --format fasta --genus "Bacteroides fragilis" bacteria -p 8
Since this will create individaul folder per genome, we can pull them together into a single folder.
# make a symbolic link to the genbank file
mkdir data
## Create a symbolic link of genome files to the data folder
ln -s /Users/ravinpoudel/Documents/sourmash/refseq/bacteria/**/*.fna.gz data/
Now we have the input files/ genomes. We will apply kmer and minHash methods. Create MinHash sketches(signature) at k-mer sizes of 31
time sourmash compute --scaled 1000 -k 31 -p 16 *.fna.gz --outdir Bfrag_many_sig
Let say one of the signature represents a signature assocaited with reference genome. Then if we want to find the closet match of query sigature to the reference then,
reference signature = GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig
mv Bfrag_many_sig/GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig .
Now compare all the signature with the reference genome.
First, Let's turn all the singatures into an easily-searchable database with sourmash index –
sourmash index -k 31 bfragdb Bfrag_many_sig/*.sig
## output == bfragdb.sbt.json
Now we can search against reference.
sourmash search GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig bfragdb.sbt.json -n 20
############## OUTPUT ######################
16 matches:
similarity match
---------- -----
59.9% GCF_000710375.2_ASM71037v3_genomic.fna.gz
59.2% GCF_902364655.1_MGYG-HGUT-00236_genomic.fna.gz
55.1% GCF_000710365.2_ASM71036v3_genomic.fna.gz
53.3% GCF_000297695.1_Bact_frag_HMW_610_V1_genomic.fna.gz
52.8% GCF_003439065.1_ASM343906v1_genomic.fna.gz
48.5% GCF_000724665.2_ASM72466v3_genomic.fna.gz
46.9% GCF_000724815.2_ASM72481v2_genomic.fna.gz
46.8% GCF_000724805.2_ASM72480v2_genomic.fna.gz
45.8% GCF_001695355.1_ASM169535v1_genomic.fna.gz
44.8% GCF_001693695.1_ASM169369v1_genomic.fna.gz
43.7% GCF_002849695.1_ASM284969v1_genomic.fna.gz
43.2% GCF_000157015.1_ASM15701v1_genomic.fna.gz
42.1% GCF_003463545.1_ASM346354v1_genomic.fna.gz
41.7% GCF_000297755.1_Bact_frag_HMW_616_V1_genomic.fna.gz
41.5% GCF_003465265.1_ASM346526v1_genomic.fna.gz
41.4% GCF_003464635.1_ASM346463v1_genomic.fna.gz
Just curious, how well will the match when search within the same genome.
sourmash search GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig
##########OUTPUT ##########
1 matches:
similarity match
---------- -----
100.0% GCF_003471465.1_ASM347146v1_genomic.fna.gz
Compare many signatures and build a tree– application towards comparative genomics.
# Compare all the things, and create comparision plots:
sourmash compare -p 16 Bfrag_many_sig/*.sig -o bfrag_compare
sourmash plot --pdf --labels bfrag_compare
The maximum similarity in the matrix is 1, meaning the two signatures were from the same species at the k-mer size of 31. Some of signature have 0% similarity with other signatures, suggesting the presence of outlier group.
########## Download a database containing all of the GenBank microbial genomes:
curl -L -o genbank-k31.lca.json.gz https://osf.io/4f8n3/download
Next, run the ‘gather' command to see what's in your ecoli genome –
# comparing your fasta/genome across all the GenBank microbial genomes
sourmash gather -k 31 GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig genbank-k31.lca.json.gz
##### output #######
electing specified query k=31
loaded query: GCF_003471465.1_ASM347146v1_ge... (k=31, DNA)
loaded 1 databases.
overlap p_query p_match
--------- ------- -------
4.6 Mbp 76.1% 74.3% JMZY02000001.1 Bacteroides fragilis s...
4.5 Mbp 7.3% 7.4% JMZX02000001.1 Bacteroides fragilis s...
3.9 Mbp 2.5% 2.4% EQ973245.1 Bacteroides fragilis 3_1_1...
4.0 Mbp 1.8% 1.9% JMZZ02000001.1 Bacteroides fragilis s...
4.0 Mbp 1.3% 1.4% JPHS01000001.1 Bacteroides fragilis s...
4.3 Mbp 1.0% 1.0% JH815482.1 Bacteroides fragilis HMW 6...
found less than 50.0 kbp in common. => exiting
found 6 matches total;
the recovered matches hit 90.0% of the query
In the case of metagenomics, you will have multiple metageomes that you want compare across GenBank microbial genomes:
For instanace, lets download metagenome / contig. Lets download a contigs file from HMP2 database(https://ibdmdb.org/tunnel/public/HMP2/WGS/1818/products) - eg- CSM79HLA_TR_contigs.fna
# crate a signatiure file for CSM79HLA_TR_contigs.fna
sourmash compute --scaled 1000 -k 31 -p 16 CSM79HLA_TR_contigs.fna -o CSM79HLA_TR_contigs.sig
# Now compare across GenBank microbial genomes
sourmash gather -k 31 CSM79HLA_TR_contigs.sig genbank-k31.lca.json.gz
## OUTPUT #############
selecting specified query k=31
loaded query: CSM79HLA_TR_contigs.fna... (k=31, DNA)
loaded 1 databases.
overlap p_query p_match
--------- ------- -------
4.4 Mbp 6.8% 73.6% MKXP01000012.1 Bacteroides vulgatus i...
4.1 Mbp 6.3% 73.8% JH724079.1 Bacteroides caccae CL03T12...
4.1 Mbp 6.2% 87.0% CZAD01000001.1 Bacteroides uniformis ...
3.8 Mbp 5.7% 46.8% JH724226.1 Bacteroides ovatus CL02T12...
3.3 Mbp 5.0% 72.3% DS264552.1 Parabacteroides merdae ATC...
2.4 Mbp 3.7% 68.0% KI271434.1 Oscillibacter sp. KLE 1728...
2.3 Mbp 3.6% 61.8% CP003274.1 Alistipes finegoldii DSM 1...
.
.
.
found less than 50.0 kbp in common. => exiting
found 90 matches total;
the recovered matches hit 81.1% of the query
### Export as csv file
sourmash gather -k 31 CSM79HLA_TR_contigs.sig genbank-k31.lca.json.gz -o CSM79HLA_TR_out.csv
Using sourmash LCA to do taxonomic classification. Sourmash LCA is using k-mers to do taxonomic classification, using the "lowest common ancestor" approach.
sourmash lca classify --db genbank-k31.lca.json.gz \
--query GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig
# OUTPUT #########
loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
outputting classifications to -
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
GCF_003471465.1_ASM347146v1_genomic.fna.gz,found,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,Bacteroides fragilis,
classified 1 signatures total
######
sourmash lca summarize --db genbank-k31.lca.json.gz \
--query GCF_003471465.1_ASM347146v1_genomic.fna.gz.sig
### OUTPUT ########
loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
loaded 1 signatures from 1 files total.
87.7% 529 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides
87.7% 529 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae
90.7% 547 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales
90.7% 547 Bacteria;Bacteroidetes;Bacteroidia
90.7% 547 Bacteria;Bacteroidetes
92.4% 557 Bacteria
8.0% 48 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis