# Cancer risk across mammals – Nature

Documenting cancer in wild animals is extremely challenging in most cases owing to the lack of information on the age of individuals, the difficulty retrieving the bodies for necropsy and the likelihood of cancer negatively influencing survival before cancer itself could be detected. Although data on cancer incidence from wild populations would be indispensable to describe natural incidences of malignancies, such data, especially with corresponding ages and demographic histories, are unfortunately still far from our reach. Therefore, to estimate cancer mortality risk, we used data provided by Species360 and the Zoological Information Management System (ZIMS, Data Use Approval Number 73836), an international non-profit organization that maintains a real-time and centralized database of animals under human care (regrouping information from over 1,200 zoos worldwide). Although we recognize that the interpretation of data gathered on zoo animals requires caution, owing to strong human control on the diet, health, mortality factors, environment or standard biological functions of the animals, zoos provide exceptionally high data resolution on the demography and cause of death for a wide range of species. Here we rely on the high probability of body retrieval of deceased zoo animals and the necropsy routinely performed on most of them (unless found in an advanced stage of decomposition), aiming to identify the most likely pathology causing the death of the animal. These examinations are likely to reveal most solid tumours, but (although possible) benign tumours, liquid tumours (for example, leukaemia) or early-stage cancers are unlikely to be recorded here, either owing to their diagnostic difficulty or their perceived limited role in contributing to the death of the animal.

Specifically, here we use the husbandry module of ZIMS, providing information on birth, death, sex and pre-defined categories of pathological findings, including neoplasia (by definition tumours that contributed to the death, albeit with no option to specify the cancer type or other details). No statistical methods were used to predetermine sample size, but to minimize bias caused by potential temporal heterogeneity in data management practices and necropsy record-keeping15, here we focused on individuals alive or born after 1 January 2010 (data extraction: 30 May 2020). This sample was then used to characterize species-specific life expectancies and cancer incidence, but only after the exclusion of data that did not fulfil a series of criteria, to ensure the highest and most homogeneous data quality possible. First, cancer is an age-related disease that rarely occurs in juveniles, and pediatric cancers are usually medically distinct from adults’ cancers. As such, infant mortality differences observed across species can significantly confound cancer incidence estimates. Therefore, we gathered sex-specific or species-specific (wherever the former was not available) ages at sexual maturity and we considered individuals for analyses only if they reached maturity before or during our sampling period. For individuals of unknown sex (about 12% of all individuals in the raw data extract), the maximum of ages at sexual maturity of males and females was used as an inclusion age threshold. Sex-specific age at sexual maturity for each species was obtained from Conde et al.19 or from published literature resources (see data sources in https://github.com/OrsolyaVincze/VinczeEtal2021Nature/blob/main/SupplementaryData.xls). Second, given that age is a key predictor of cancer emergence, we considered only individuals for which date of birth was recorded precisely or within a narrow (maximum 30 days) time interval. Third, we considered only species in which postmortem pathological records were available for at least 20 adult individuals, irrespective of the cause of death (for example, infection, accident, geriatric disease and so on). Nonetheless, models presented were performed with increased thresholds of 40, 60, 80 and 100 individuals to check for consistency in the results (Supplementary Table 2). Fourth, given that the process of domestication is widely regarded as a major contributing factor to inbreeding depression and higher incidence of cancer39, we excluded all species that were subject to domestication as well as their wild ancestors (taxa excluded owing to being subject to domestication are listed in Supplementary Table 1). Following these restrictions, data extraction on age and cause of death resulted in information for 110,148 (62,556 live and 47,592 dead) individuals (n = 191 species). For calculation of ICM, we included only species in which survival is correctly estimated until old ages (that is, data allowing the estimation of age-specific survival until the age at which only 10% of individuals are surviving, n = 172 species). While these restrictions removed multiple sources of bias in our cancer mortality risk estimates, we cannot exclude the possibility that some species (for example, more charismatic ones) are subject to more frequent or more detailed necropsies. Nonetheless, our statistical approach, especially the complete case analysis, is largely insensitive to such biases, as individuals not having available postmortem diagnostic records are considered censored (see below). Also, while the depth of necropsy might vary slightly among species, neoplasia that had a significant contribution to the death of the animals (the focus of our study) are generally detected even at gross necropsies. Additionally, large species are considered of key importance for zoos, also reflected by the fact that the proportion of dead individuals with postmortem pathological records is larger in larger species (Pearson’s correlation: r = 0.24, t = 3.35, df = 189, P = 0.001). Accordingly, if charisma had a role in cancer detection, we would expect a larger cancer risk in large mammals, opposite to the (nonsignificant) negative body mass effect in our models. Consequently, we believe that charisma is unlikely to represent a major source of bias in our analysis.

### Estimation of adult life expectancy

As we have no reason to believe that censored individuals would not have the same prospect of survival as those who continue to be followed, we estimate adult life expectancy from age-specific survival estimated using the Kaplan–Meier procedure (using the survfit function in the R package survival40). Individuals older than their age at sexual maturity on 1 January 2010 were left-truncated at their age at this date; individuals reaching sexual maturity after this date were left-truncated at their age at sexual maturity. Individuals still alive at the time of data extraction were considered right-censored (samples per species varied from 42 to 5,816 individuals), while known fate individuals were assigned as dead (n = 47,592), irrespective of whether their cause of death was specified or not.

### Estimation of ICM

ICM was calculated using a competing risk approach, based on the cumulative hazard of cancer-related deaths and survival probability of the species under human care. First, age-specific survival Sx, at age x, was estimated from KM analysis as above. However, here we performed a complete-case analysis, using only 11,840 individuals for which the cause of death was specified together with right-censored survivors. Complete-case analysis assumes that missingness in the cause of failure is random, but we had no reason to believe that this was not the case in our dataset. Postmortem examinations are routinely carried out on most recovered bodies in zoos, and once examinations are performed the results are equally likely to be entered in the database irrespective of the pathologies identified. ICM estimates were thus based on n = 74,396 individuals, n = 179 species. Second, the cancer mortality hazard hcx was estimated using a KMx1 analysis where only deaths by cancer were incorporated as a death event. ICM is then such that

$${rm{I}}{rm{C}}{rm{M}}=mathop{sum }limits_{x=alpha }^{{rm{infty }}}{S}_{x}{h}_{x}^{{rm{c}}}$$

where α is the age at sexual maturity. The only difference with classic estimation is that we extracted hcx (and Sx) for each time unit with discrete jumps (and falls) at event times at age t and with hct≤x<t+1 = 0 (Sx constant) between these events; instead of estimating discrete hazard ({h}_{t}^{{rm{c}}}={d}_{t}^{{rm{c}}}/{n}_{t}) on these time intervals (where dct is the number of deaths by cancer within the interval and nt is the number of survivors at the beginning of the interval). We chose this method to reflect true variation in the data for the interspecific comparison where species differ greatly in the number of events and time interval between these (sometimes a third of the organism’s adult lifespan), a situation rarely met when comparing human groups.

### Covariates and statistical analyses

For each species, we obtained sex-specific adult body mass data from Species360’s ZIMS (see https://github.com/OrsolyaVincze/VinczeEtal2021Nature/blob/main/SupplementaryData.xls). Species-specific body mass was calculated as the average of all body mass measurements recorded in the ZIMS database in adults, while species-specific values were obtained by averaging the body masses of males and females. These were calculated only in species for which there were at least 100 adult body mass records; otherwise, body mass was taken from the literature and database review by Conde et al.19. We verified that there was a one-to-one correspondence in the body mass information for species with records in both datasets.

Diet information was obtained from a global diet dataset for terrestrial mammals32, providing information on diet composition at four hierarchical levels of food items (never consumed, occasionally consumed, secondary food item, primary food item). We collected information on animal content in diet, as well as subcategories of this diet class, namely invertebrate or vertebrate consumption, as well as specifically fish, reptile, bird and mammal consumption. Given that most food items had few species in the intermediate levels (occasional consumption and secondary food item), we re-categorized the diet variables at two levels: never/rarely consumed or representing the primary/secondary food item of the species. The effect of diet was tested in PGLS regressions using only species with non-zero cancer mortality risks. Models were run separately for each food item that were entered to a base model including body mass and adult life expectancy as covariates. Results are shown in Extended Data Table 2.

To account for statistical non-independence due to phylogenetic relationships, we obtained a sample of 1,000, equally plausible phylogenetic trees, from the posterior distribution published by Upham et al.41, covering 5,911 species. We then obtained a rooted consensus tree using the sumtrees Python library42. Two species recently raised to species level were manually added to the tree as sister taxa of the species it was recently separated from (that is, Cervus canadensis to Cervus elaphus and Gazella marica to Gazella subgutturosa). Phylogenetic signal of cancer risk was assessed using the function phylosig from the R package phytools22. Partial coefficients of determinations were calculated using the function R2.pred from the R package rr2 (ref. 43), based on models presented in Extended Data Table 3.

Models of cancer risk testing Peto’s paradox were performed using zero-inflated logistic models, which allow us to make inferences on the probability of detecting at least one cancer case in the species and, given that cancer was detected, inferences on the CMR or ICM. Therefore, the first part of this consisted of a phylogenetic binomial regression (using the function binaryPGLMM, in the R package ape44), where the dependent variable explained the presence of zeros and non-zeros in CMR or ICM. This model contained the log number of deceased individuals with available postmortem pathological records as an explanatory variable, due to the higher probability of detecting cancer with a higher number of dead individuals inspected. Additionally, the model contained body mass and adult life expectancy as covariates. The second part of the model consisted of a PGLS regression that investigated variance only in non-zero cancer risks. ICM and CMR were logit-transformed in all PGLS models as recommended when analysing proportions45. These models were weighted by log number of deceased individuals with available postmortem pathological records, as the precision of cancer mortality risk estimates is expected to increase with the number of dead individuals subject to postmortem examination, but it is not expected to explain bias in the estimation of the dependent variable in any particular direction (as in the case of the binomial models). These models also contained body mass and adult life expectancy as explanatory variables. Given the expected additive effect of body mass and longevity, the interaction between body mass and longevity metrics was also tested in all four models (binomial and logistic regressions for CMR and ICM), but these interactions did not increase model fit in any case and are therefore not presented. Both models were controlled for phylogenetic relatedness among species, where the scaling parameter of phylogenetic dependence (that is, s2/Pagel’s λ in PGLMMs and PGLSs respectively) was set to the most appropriate values assessed by likelihood ratio statics in each model separately. PGLS models in which Pagel’s λ converged to negative values were refitted with Pagel’s λ fixed at 0. Three species (Lagurus lagurus, Cricetus cricetus and Dasyuroides byrnei) had been removed from the latter models, due to their high leverage caused by their very low adult life expectancy compared to the rest of the species and therefore concerns of strong influence of these points over model fit. Nonetheless, all models were performed using the entire dataset, and the results were highly consistent with and without the exclusions (Supplementary Table 4 and Extended Data Fig. 7).

Order differences in cancer incidence were tested using standard linear regressions, built using only taxonomic orders in which at least two species had their cancer incidence estimated. The model contained CMR or ICM (non-transformed) as dependent variables and order as the sole explanatory factor. Pairwise order differences were assessed using the R package emmeans46. All analysis were performed in the R Statistical and Programming Environment, version 4.0.4 (ref. 47). Cancer mortality risks were transformed to percentages in figures and in the analysis performed on order differences (Extended Data Table 1), for easier interpretation. Models presented in Extended Data Tables 2 and 3 and Supplementary Tables 2–4 are based on probabilities.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.