Population structure analysis
Principal Component Analysis (PCA) was performed using PLINK software
version 2.0 . First, only biallelic SNPs were selected, and linkage
disequilibrium (LD) pruning was performed on the vcf file encompassing
all variants in the core genome using PLINK, followed by PCA analysis
using the first 20 principal components. PCA results were plotted in R
using the ggplot2 library. Starting from the LD pruned dataset,
admixture analysis was performed with the ADMIXTURE software version
1.3.0 . The optimal number of populations was determined by running
ADMIXTURE for a range of K-values (i.e. , number of populations)
from 2 to 50. This involved a 10-fold cross-validation, and selection of
the K-value for the number of populations with the lowest
cross-validation error. Phylogenetic trees were constructed by first
converting the vcf file to PHYLIP format using the vcf2phylip.py script
. Phylogenetic trees were constructed using RAxML, with P.
knowlesi defined as outgroup, using the GTR+G evolutionary model and
using a bootstrapping value of 100 . The phylogenetic tree was
visualized using the ggtree library in R. Nucleotide diversity was
determined by sliding across the genome in 500-bp windows over all
LD-pruned SNPs of the core genome using Vcftools . The multiplicity of
infections was calculated using the getFws command as implemented
in the moimix package in R .