Home / About Page

PopPUNK is a tool for clustering genomes. We refer to the clusters as variable-length-k-mer clusters, or VLKCs. Biologically, these clusters typically represent distinct strains. We refer to subclusters of strains as lineages.

The first version was targeted specifically as bacterial genomes, but the current version has also been used for viruses (e.g. enterovirus, influenza, SARS-CoV-2) and eukaryotes (e.g. Candida sp., P. falciparum).

Under the hood, PopPUNK uses pp-sketchlib to rapidly calculate core and accessory distances, and machine learning tools written in python to use these to cluster genomes. A detailed description of the method can be found in the paper.

If you are new to PopPUNK, we’d recommend starting reading the best practises guide.

Related tools

As well as the 'core' tool of PopPUNK for genomic epidemiology, the following related packages are also available

  • pp-sketchlib (installed as poppunk_sketch) is a high performance genomes sketching library developed specifically for PopPUNK. It can rapidly calculate pairwise core and accessory distances between long genomes, and makes use of multiple CPUs or even GPUs. It can also be used standalone as a drop-in replacement for mash, and supports faster queries, uses smaller databases on disk, and a wider range of sequence types.
  • PopPIPE is a pipeline which automates some common downstream analyses on strains defined by PopPUNK. It produces subclusters, maximum likelihood trees and associated visualisations using a computationally efficient approach.


If you find PopPUNK useful, please cite as:

Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Research 29:1-13 (2019). doi:10.1101/gr.241455.118