Databases come in two flavours: reference only, or all genomes.

Typically, the reference only database will be sufficient for the main use case of assigning new samples to PopPUNK clusters, and updating the database with new clusters which have been found. The reference databases are usually significantly smaller.

For more detailed analyses, you may wish to download the all genomes database. If you wish to run either poppunk-visualise or any subclustering within strains this will require the full database.

In either case only the reference genomes will actually be used for query assignment, which does not change the results but gives a good speed up in program runtime.

See the distributing models doc page for more details.

Database list:

Streptococcus pneumoniae

42,157 genomes

From the Global Pneumococcal Surveillance project, and other sequence collections. Used to assign global pneumococcal sequence clusters (GPSCs).

References only All genomes

Streptococcus pyogenes (group A Streptococcus)

2,084 genomes

From Davies et al.

References only All genomes

Escherichia coli

10,287 genomes

From Horesh et al.

References only All genomes

Streptococcus mitis

323 genomes

Contributed by Akuzike Kalizang'oma, based on publicly available data and carriage data.

All genomes