Metagenomic analysis of Mesolithic chewed pitch reveals poor oral health among stone age individuals

The specific environmental/history/collection context

The Huseby Klev materials were unearthed and collected by archaeologists (including two of the co-authors of this article) during the excavation of this coastal hunter-fisher-gatherer site in the 90s50.

The material assemblage was rich and well preserved: human bones, animal bones, plant remains and pieces of masticated pitch were found. The exceptional preservation was presumably caused by a layer of clay which sealed the site into a geological time capsule. Pitch or “tar” pieces are a rather common find at Stone Age sites in Eurasia, and particularly in Scandinavia, where well preserved pieces of masticated pitch are found dating well into the Iron Age57. The pitch pieces were produced from birch bark tar, thus making birch a valuable resource during the entire Stone Age. There are about 90 masticated pitch pieces from the site, while tooth imprints in six of them have been cast and osteologically analysed. All the analysed pieces had imprints suggesting individuals younger than 20 years: three aged between 5 and 11 and three teenagers50. The pieces we used for the study were directly sampled from the material collection, which is under the guardianship of Bengt Nordqvist.

Experimental procedure

The DNA extraction and library building was performed at Stockholm University, AFL, in ancient DNA dedicated facilities, all designed to have positive pressure, separate airlock, HEPA filters for the incoming air, UV lights at the ceiling, and constant sodium 3% hypochlorite disinfection (NaOCl). Sampling of the masticated pitch pieces was performed in a “drill laboratory”, a separate room to the “main laboratory” where exclusively sampling of the materials was undertaken to minimise contamination. The material was irradiated in a UV oven at about 6 J/cm2 at 254 nm. When the necessary amount of the samples (between 50 and 100 mg) were collected into 2 mL screw cap tubes, the tubes were taken to the main laboratory for further extraction and library building. A blank was added at this stage. We checked the DNA concentration after extraction using Qubit 3.0 high sensitive double strand DNA fluorometric assay (Thermo Fisher Scientific). We did not observe any trace of contaminating DNA in the blank extractions.

In the main lab extraction was performed using two methods: the Yang extraction method58 which is commonly used for extraction of bone tissue (incubation at 55 °C in 1000 μL of buffer per sample containing 0.5 M EDTA pH8, 1 M Urea, and 10 μL of Proteinase K 10 mg/mL, followed by concentration of the supernatant using Amicon membrane filters and purification with MinElute spin columns to obtain 110 μL of final product) and the QIAamp PowerFecal DNA Kit (Qiagen) with which we obtained 100 μL of final product, following the instructions of the kit with minor adjustments.

The libraries were also built in the main laboratory, following a modified protocol by Meyer and Kircher59 for double-stranded blunt-end-repair libraries and damage-repaired libraries as in56 pre-treating the extracts with USER enzyme. Libraries were amplified with 10 µM index primers, later double-indexed, using AmpliTaq Gold 1000 Units 5 U/μL (Applied Biosystems) for blunt-end libraries, and AccuPrime™ Pfx DNA Polymerase (2.5 U/μL) (Invitrogen) polymerase for damage repair libraries (damage repair applied to only ble004 sample). We determined the number of cycles using quantitative PCR, with reagents from Thermo Scientific and Biomers. Libraries were sequenced on the Illumina Hiseq X platform at the SciLife centre in Stockholm and SciLife centre in Uppsala. The damage repaired libraries of the ble004 sample are shown in Table S12.

DNA libraries and external datasets

After receiving the sequences, AdapterRemoval60 was used to trim adapters, nucleotides less than 20 base quality, and filtered shorter than 10 nucleotides and subsequently subjected to human demography pipeline.

In this study we focused on DNA sequences that do not map to the human reference genome. To do this we first filtered human DNA reads from fastq files and we used cutadapt61 to trim remaining adapters, and nucleotides less than 20 base quality, and 10 nucleotides were also removed. Final sequencing depth of the fastq files is shown in Table S12.

In addition, we used published ancient and modern datasets to understand the microbial composition of ancient mastics. First, we used pre-calculated MetaPhlAn3 relative abundances of modern human microbiome samples using the curatedMetagenomicData package25,26,27,28,29,30,62,63. Next, we downloaded published DNA data from ancient dental calculus and chewed pitch studies8,9,10,11,14,15,16,17,19,22,23,24. At last, we used published studies that focus on sediments to identify environmental microbiome contribution to pitch samples28,33,34,64 (Table S13).

For differential abundance analysis, we used a dataset containing salivary microbiome data of 12 healthy, 12 caries, and 10 periodontitis patients40,41.

We downloaded the fastq files from European Nucleotide archive and processed paired-end DNA sequences using AdapterRemoval tool61 tool by trimming bases that have less than 20 phred quality score and collapsed forward and reverse reads if they had overlapping regions with 11 or more nucleotides.

Metagenomic profiling of the ancient mastics

We used MetaPhlAn3 with default options to profile the fastq files of the ancient samples62. This tool uses a marker database (that contains approximately 13,500 bacterial and archaeal, 3,500 viral, and 110 eukaryotic marker DNA fragments from ~ 17,000 reference genomes) to profile a DNA library and produces a microbial relative abundance table. This describes the proportion of DNA fragments that belong to a particular taxon, and a sequence alignment map (sam) file that contains the alignment information. After profiling each DNA library, we removed optical PCR duplicates from sam files with picard-tools65, sam files that correspond to each library were merged for each sample, and relative abundance tables were calculated with MetaPhlAn3 for each sample.

To further analyse the dataset, we prepared a set of reference and representative genomes of bacterial, archaeal, fungal, and viral species from the NCBI RefSeq database66 (N = 6133) and indexed them using the malt alignment tool38 and we aligned each fastq file to this indexed reference collection with the same tool. We used the dust-masker67 tool to mask repetitive regions that could create false-positive classifications. Then, we aligned the sequenced libraries to the database using malt38 with default options, and we obtained binary rma6 files. These files contain the alignment and taxonomy information for each sequence. Afterwards, we used MEGAN68 to extract the species absolute abundance values for each fastq file.

Comparing microbial profiles from ancient samples with modern human microbiome

To understand the metagenomic composition in ancient mastic material, we compared the relative microbial abundances from the ancient mastic samples against a dataset consists of modern human oral cavity (N  = 725), modern human nasal cavity (N  = 93), modern human skin (N = 42), modern human stool (N = 500), modern human vagina (N = 86), ancient human dental calculus (N  = 395), and chewed pitch (N  = 1). The site composition of the oral microbiome samples are anterior nares (N = 93), buccal mucosa (N  = 119), hard palate (N  = 1), keratinized gingiva (N  = 6), palatine tonsils (N = 6), saliva (N  = 5), subgingival plaque (N  = 72), supragingival plaque (N  = 127), and tongue dorsum (N  = 198). Some of the oral microbiome samples do not have any site information and therefore were not included in the list.

We calculated pairwise Bray–Curtis distances between each compositional vector using the vegdist function in the vegan package of the R statistical platform32,69,70. The distance matrix was visualised using a non-metric multidimensional scaling (NMDS) implemented in metaMDS in the vegan library with a K value of 7. The first three axes were selected for visualisation.

At last, we used SourceTracker43 to quantify the oral microbiome content in chewed pitch materials30. To do this, we selected species taxonomic nodes from the MetaPhlAn3 abundance matrix, and we marked the comparison dataset from modern human oral microbiome samples and ancient dental calculus samples as source materials. Additionally, we included sediment (N = 23), and shallow marine sediment (N = 21) as environmental sources. Then we used the SourceTracker tool with the default options, to quantify the source contribution to the ancient samples (Table S14 – S22).

Generating full genome ancient bacterial sequences

We collected all reads that were assigned to species’ taxonomy nodes and combined them into one fastq file for each species, with all reads that align to at least one reference genome of that species, and nothing else.

Based on the taxonomic assignment results, we selected the corresponding reference genomes (n = 281). And we aligned each fastq file to the respective reference genome, separately.

To produce full genome sequences, we extracted aDNA reads for each bacteria using malt-extract38.

Thereafter, DNA reads for each sample were merged and aligned to their respective bacterial full genome sequences using bwa71.

Sequence alignment files were filtered using SAMtools72 with alignment quality (q30) parameter to remove bad alignments, and DNA sequences with length less than 30 nucleotides were discarded from the alignment file. Optical PCR duplicates were removed using Picard-tools with the mark-duplicates option. Genome coverage was calculated using SAMtools72. Deamination patterns were obtained by the mapDamage tool73 with default options. User enzyme treated (damage repaired, dr) ble004 libraries were excluded from the analysis to calculate reliable deamination patterns. Results from these libraries are denoted with the suffix nondr. Finally, we generated genome authentication plots that contain edit distance distribution, length distribution, deamination patterns, and genome coverage plots. Damage repaired ble004 libraries were added in the analysis after we obtained reliable deamination patterns in non-damage repaired libraries.

Also, we co-assembled DNA reads using the megahit tool with the k-list parameter 21,29,39,59,79,99,11974. In essence, we de novo assembled all DNA reads that belong to one particular sample and we classified the contigs using krakenuniq tool with the full NT database that contains Eukaryotic full length genomes75,76. To calculate the authenticity of the contigs, we used pyDamage tool77. To assess the assembly quality, we used the Quast tool78 and specified reference genomes of each oral microbe that were used in the reference-based genome assembly step. With this approach we aimed to estimate the assembled proportion of each microbe (Table S2).

Beta-binomial modelling to find overabundant microbes in ancient pitch materials

A beta-binomial statistical model was used to detect which commensal oral microbes are overabundant in ancient mastics. We first used modern salivary microbiome datasets that contain healthy and dysbiosis cases (periodontitis and dental caries), restricted to the species that are commonly found in human oral cavities, as described in the Human Oral Microbiome Database. Then we used the beta-binomial method implemented in the corncob79,80 R package to build differential abundance models for each bacterial species. The cases where the differences between healthy and dysbiosis are statistically significant (multiple tests corrected using Benjamini–Hochberg method with a cut-off value of 0.05) suggests that these microbes could be markers of dysbiosis.

After selecting the marker species for dysbiosis in the modern salivary microbiome, we applied the same methodology to ancient mastics. First, the sequencing depth of ancient mastic libraries was scaled so the average depth of each library matched the mean value of the healthy salivary datasets. This way all distributions have the same order of magnitude and are therefore easier to compare.

Then, a beta-binomial test was applied to each sample, and significant taxa and models were extracted. Afterwards, we searched for marker species in the list of significant species, and we restricted the species list to the species in the Human Oral Microbiome Database.

Moreover, the same methodology was used to identify the differences between pitch materials and ancient dental calculus samples that had been identified with periodontitis17.

Machine learning to find predictors of dysbiosis

To estimate the overall health of the people who chewed the pitch material, we used a machine-learning classification method known as random forests42. Random forests aggregate predictions from different classification trees to reduce variance and to improve robustness75. In essence, the method creates several simple classification trees, each with a randomly chosen subset of features, and combines their predictions through voting.

In our case, the features correspond to the different species considered, and the data is the relative abundance of these species found in each sample. The labels used for training are the dysbiosis conditions: caries, periodontitis, or healthy. In this part, we calculated the relative abundance restricted to oral specific microbial species.

Each random forest is trained using only part of the data (i.e., bagging), and uses the remaining data to evaluate an out-of-bag error rate, which is numerically close to the N-fold classification error rate76. In other words, a low out-of-bagging error indicates that the random tree will generalise well for new data.

By evaluating the mean decrease in Gini impurity (MDG), random forests can rank each feature (i.e., species) sorting out which ones are more relevant to characterise the health condition. We used the MDG ranking to iteratively remove the least important feature, until five features remain, following Darst et al.81. In each iteration we created ten independent random forests, we averaged the MDG for each feature, and we discarded the feature with the lowest average MDG. This procedure was repeated until five features remained. With this approach, we kept only the features that contribute to the overall accuracy of the model.

This whole feature selection process was repeated three times. Finally, we sorted all the trees built in the feature selection process by their out-of-bagging error rate, and kept the best two models in each major iteration to make a model pool.

To have a simpler interpretation, we handled the periodontitis and caries cases separately, building two independent model pools.

We used the randomForest function of the R statistical package with default parameters to build random forests models70,82.

Multiple models trained with the same feature set were combined using the combine function in the randomForest library. This model pool was used to estimate the probability of each health condition for the ancient material, using the predict function over the relative abundance tables from the Mesolithic pitch materials. These probability values were then plotted using the ggplot2 and ggpubr libraries on the R statistical package83,83.

Eukaryotic taxonomic classification

To study the presence of eukaryotic species, we first removed the bacterial reads that we identified from the previous step. Then, we used Kraken284 to classify these reads using the full NCBI NT database that contains the all available reference sequences at that time.

We also used several confidence thresholds to filter the possible false positive classification (0.1—0.5). We manually restricted the resulting abundance table to keep only eukaryotic taxa with at least 10,000 DNA reads.

To authenticate these eukaryotic DNA reads, we extracted reference sequences for each eukaryote from the Kraken2 database. Then we used bwa with custom parameters (L = 16,500, N = 0.01, O = 2) to align the reads assigned to each taxa against their corresponding genome71.

After alignment, we removed nucleotides shorter than 30 bp, and validated the antiquity of the eukaryotic DNA reads using the same aDNA authentication protocol that we used for bacteria. We plotted the edit distance distribution, length distribution, and deamination plots for each set of DNA reads aligned to a eukaryotic species. Each plot was manually scored between 0 and 1 (1 for ideal and 0 for not ideal) based on the visual confirmation criteria: edit distribution should peak at low values and decline rapidly, there should be deamination patterns close to the reads’ extremes, and the distribution of DNA length should be concentrated in small values. Then, the entries were sorted according to the scores and to the number of DNA reads aligned to the reference genome (Table S7—S11).

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *