Article Text
Abstract
Objective Polygenic risk scores (PRS) for diverticular disease must be evaluated in diverse cohorts. We sought to explore shared genetic predisposition across the phenome and to assess risk stratification in individuals genetically similar to European, African and Admixed-American reference samples.
Methods A 44-variant PRS was applied to the All of Us Research Program. Phenome-wide association studies (PheWAS) identified conditions linked with heightened genetic susceptibility to diverticular disease. To evaluate the PRS in risk stratification, logistic regression models for symptomatic and for severe diverticulitis were compared with base models with covariates of age, sex, body mass index, smoking and principal components. Performance was assessed using area under the receiver operating characteristic curves (AUROC) and Nagelkerke’s R2.
Results The cohort comprised 181 719 individuals for PheWAS and 50 037 for risk modelling. PheWAS identified associations with diverticular disease, connective tissue disease and hernias. Across ancestry groups, one SD PRS increase was consistently associated with greater odds of severe (range of ORs (95% CI) 1.60 (1.27 to 2.02) to 1.86 (1.42 to 2.42)) and of symptomatic diverticulitis ((95% CI) 1.27 (1.10 to 1.46) to 1.66 (1.55 to 1.79)) relative to controls. European models achieved the highest AUROC and Nagelkerke’s R2 (AUROC (95% CI) 0.78 (0.75 to 0.81); R2 0.25). The PRS provided a maximum R2 increase of 0.034 and modest AUROC improvement.
Conclusion Associations between a diverticular disease PRS and severe presentations persisted in diverse cohorts when controlling for known risk factors. Relative improvements in model performance were observed, but absolute change magnitudes were modest.
- DIVERTICULAR DISEASE
- GENETIC POLYMORPHISMS
- SURGICAL RESECTION
Data availability statement
No data are available. Individual-level data are not available to protect the privacy of biobank participants. Analysis workspaces and code can be shared upon request to approved All of Us Researcher Workbench users.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Polygenic risk scores may assist in individualizing approaches to risk stratification for heritable diseases. Loci associated with diverticular disease have been established, while investigations of diverticulitis polygenic risk scores have been performed in participants genetically similar to European reference samples.
WHAT THIS STUDY ADDS
This multibiobank study found that a polygenic risk score for diverticulitis was associated with up to 86% greater odds of severe disease, with evidence of transferability to a diverse cohort.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
The use of polygenic risk scores in counselling may inform decisions about escalation of care, dietary counselling or selection for endoscopic evaluation. This work provides direction for future functional investigations with a recommendation to focus on connective tissue biology.
Introduction
The diverticular disease spectrum spans from asymptomatic diverticulosis to complicated diverticulitis, with more severe disease conferring a substantial burden on quality of life. While previously thought that the chances of experiencing a severe episode were related to episode frequency, the first diverticulitis episode likely poses the greatest risk of complications.1 The lack of tools for personalised stratification is especially important in light of recent guideline changes emphasising case-by-case circumstances when considering colectomy.2 Lifestyle factors are important modifiers,3 but genetic variation has been shown to be a risk factor for severe presentations.4
Genome-wide association studies (GWAS) have identified loci linked with diverticular disease.5–9 One promising approach to translate these findings into a clinical tool is through a polygenic risk score (PRS), which combines effects of genetic variants across the genome into a single score for an individual. PRSs have been explored as an adjunct for personalised screening recommendations10–12 and to inform discussions about risk of disease complications.4 While conclusions regarding clinical utility have been mixed,13 PRSs may bring value for diseases where existing stratification approaches are limited. Towards this end, PRSs have been developed for diverticular disease using participants genetically similar to European reference samples.4 6 14 However, European-derived PRSs perform variably when applied to other populations due to differences in linkage disequilibrium patterns, distinct risk loci resulting from varying selection pressures in independent populations and sparse availability of genetic data in non-European populations.15–17 Evidence of transferability across populations must be established prior to considering clinical implementation.
This study aimed to evaluate a diverticular disease PRS in a diverse cohort. Using phenome-wide association studies (PheWAS) of the PRS, we identified conditions associated with heightened genetic risk for diverticular disease. To assess performance in risk stratification, the PRS was included as a covariate alongside demographic and clinical risk factors in models predicting diverticulitis outcomes.
Methods
Study population and phenotyping
Our primary analysis used the All of Us Research Program while a secondary analysis used Vanderbilt University Medical Center’s biobank (BioVU). The All of Us Research Program is a longitudinal prospective cohort study combining electronic medical records with genetic information for 413 000 participants.18 BioVU is a genomic repository associated with a deidentified database of clinical records from a tertiary care centre.19 Adult participants age 18–90 years with available short read whole-genome sequencing data and electronic medical records were eligible for inclusion.
Phenotyping for the PheWAS analysis used phecodes for diverticular disease, which were mapped from International Classification of Disease, 10th revision (ICD-10) codes. Since positive predictive values range from 0.67 to 0.92 when identifying complications via ICD-10 codes alone,20 21 we adopted a validated rule-based phenotyping algorithm for diverticulitis complications in the risk stratification portion of the analysis. This algorithm combined diagnostic codes, procedural codes, settings of care and temporal relationships between codes (online supplemental material).21 A person was defined as having severe diverticulitis when they had either two inpatient admissions for diverticulitis or a procedure performed for diverticulitis (colectomy or percutaneous drain). Mild diverticulitis was defined as receiving a diagnostic code for diverticulitis in the outpatient setting or with no more than one inpatient admission, while cases of asymptomatic diverticulosis were exclusively assigned diagnostic codes for diverticulosis. Participants with second-degree relatedness or greater were excluded via the auxiliary relatedness files from the All of Us Research Program and via PLINK2’s implementation of KING’s robust estimator for BioVU.22 Participants with a diagnostic code for colorectal cancer or an inflammatory bowel disease were also excluded. Covariates were extracted from structured clinical data elements including age, sex, body mass index (BMI) and ever smoking status.
Supplemental material
Genotyping, imputation and quality control
The All of Us Research Program Controlled Tier V7 short read whole-genome samples were sequenced on the Illumina NovaSeq 6000 system with a standardised variant and sample quality control pipeline.23 24 In BioVU, 91 449 individuals were directly genotyped with the Illumina Expanded Multi-Ethnic Genotyping Array followed by imputation of autosomal small nucleotide polymorphisms (SNPs) to the TOPMed version R2 reference panel using the TOPMed imputation server.25 Preimputation and postimputation quality control was performed to limit analyses to high-quality SNPs and samples (online supplemental material).26 27 We identified samples genetically similar to gnomAD descriptors of 1000Genomes reference samples (1 KG) using precomputed data in the All of Us Research Program and principal component analysis in BioVU.28 These do not reflect distinct biologic groups but represent artificial thresholds when considering genetic similarity to reference panels.29 Throughout the manuscript, we adopt EUR, AFR, and AMR abbreviations to refer to individuals with genetic similarity to the reference European (EUR), African (AFR) and Admixed American (AMR) 1 KG samples, respectively. Alternative methodologies for assigning genetic similarity were explored with minimal differences in the distribution of groups (online supplemental material).
Polygenic risk score
The 44-SNP PRS used in this study was previously derived from a UK Biobank GWAS of European participants through conditional and joint analysis with validation in the Michigan Genomics Initiative.4 9 In this GWAS, pathway enrichment analysis suggested immunity, cell adhesion, membrane transport/signalling and intestinal motility as biological processes important to the onset of diverticular disease. In the PRS, the five variants with the largest effect sizes (rs763969618, rs115490395, rs56044859, rs4333882 and rs1802575) were associated with genes implicated in membrane transport (SLC35F3, SLC6A17) and cell adhesion (ELN, EFEMP1, PCDH10).
For each individual in our cohort, a PRS was calculated with PLINK2 by multiplying the dosage of risk-increasing alleles by their reported effect sizes, and summing across all PRS loci.4 Scores were standardised to a mean of zero and SD of one within each group.30 As a sensitivity analysis, we calculated a genome-wide PRS comprising 1 615 623 SNPs with the Bayesian PRS-CSx, which is optimised for incorporation of multiple ancestral backgrounds.31 For this, we used summary statistics from the largest available EUR GWAS for diverticular disease6 and the pan-UK Biobank AFR GWAS.32
Statistical analysis
PheWAS
To explore clinical diagnoses associated with genetic risk for diverticular disease, we performed PheWAS of the PRS in the All of Us Research Program. We included AFR, AMR and EUR given that diverticular disease code counts were greater than 200,33 and case assignment required a minimum of two code instances on distinct days. Each ancestry group was analysed independently in addition to a combined group with all participants included. Logistic regression models were fit across the phenome, including the PRS as a covariate alongside the first 10 genetic principal components, age and sex. In sensitivity analyses, we adjusted for BMI and excluded diverticular disease cases to evaluate whether associations were contingent on a simultaneous diverticular disease diagnosis. Analysis was performed using the PheWAS R package with a two-sided p value significance threshold after Bonferroni correction for multiple comparisons (EUR: p<2.83×10−5; AFR: p<3.23×10−5; AMR: p<3.33×10−5).34 Next, we evaluated whether phecodes found to have association with elevated genetic risk for diverticular disease were also associated with assignment of diverticular disease diagnoses in the medical record. For this, we included significant phecodes from PRS-PheWAS as covariates in a logistic regression model with an outcome of a diverticular disease phecode when controlling for age and sex. The most conservative Bonferroni-adjusted significance threshold among ancestry groups was adopted (p<2.83×10–5).
Clinical modelling
The primary outcome was severe diverticulitis evaluated relative to a reference group of controls with no diverticular disease. We also considered supplemental comparisons of symptomatic (mild or severe) diverticulitis as well as changing the reference group to asymptomatic diverticulosis. Model calibration was assessed through le Cessie-van Houwelingen-Copas-Hosmer (LVC) goodness-of-fit tests and calibration plots.35 Logistic regression models were fit for outcome counts exceeding 100, a criterion met for BioVU (AFR, EUR) and the All of Us Research Program (AFR, AMR and EUR) individuals. Covariates included the PRS, age at inclusion date, sex, BMI and ever-smoking status. A base model excluding the PRS served as a comparison from which to assess the incremental value of the PRS. Metrics for full and base models included area under the receiver operating characteristics curve (AUROC) and Nagelkerke’s R2 adjusted for the liability scale.30 36 All statistical analyses were performed using R V.4.3.2, and an overview of study design is shown in figure 1.
Results
PheWAS
There were 181 719 individuals meeting inclusion criteria for PheWAS. For each group, the PRS was associated with phecodes for diverticular disease (EUR: p=1.22×10−78; AMR: p=6.56×10−14; AFR: p=1.55×10−8) with ORs (95% CI) ranging from 1.18 (1.12 to 1.25) to 1.28 (1.20 to 1.37) (figure 2). Other PRS-phecode associations included abdominal hernia (OR (95% CI) 1.06 (1.03 to 1.08); p=1.76×10−6) and diaphragmatic hernia (OR (95% CI) 1.08 (1.05 to 1.12); p=2.51×10−6) for EUR models as well as diffuse diseases of connective tissue (OR (95% CI) 0.78 (0.70 to 0.87) ; p=1.07×10−5) for the AMR models. When removing cases of diverticular disease, associations with connective tissue diseases but not hernias remained (online supplemental material). The inclusion of BMI as a covariate did not alter the significance of any findings. Among the phecodes associated with heightened genetic risk for diverticular disease, those also associated with assignment of diverticular disease codes in the medical record were inguinal hernia, diaphragmatic hernia, femoral hernia, sicca syndrome and other unspecified connective tissue disease (online supplemental material).
Clinical modelling
In the All of Us Research Program, the clinical modelling cohorts included EUR (n=23 127), AFR (n=6520) and AMR (n=3699) individuals. Corresponding BioVU cohorts included EUR (n=13 767) and AFR (n=1824) individuals. Median age was 57 (IQR 50–65) with 59% women in the All of Us Research Program, while median age was 56 (IQR 49–66) with 55% women in BioVU (table 1). In BioVU, severe diverticulitis comparisons were not considered for AFR due to insufficient sample size. Among all cases of diverticular disease in the All of Us Research Program, the prevalence of severe diverticulitis was highest in the top quintile of polygenic risk, as compared with the lower quintiles (absolute prevalence difference, EUR: 1.3%; AMR: 2.5%; AFR: 1.6%) (online supplemental material). Models were adequately calibrated in the All of Us Research Program, but poorly calibrated models were found for BioVU symptomatic versus asymptomatic disease (LVC test p=0.02) and symptomatic disease versus control (LVC test p<0.01) comparisons (online supplemental material).
Covariate associations
The PRS was associated with greater odds of severe diverticulitis across groups (All of Us Research Program OR per SD increase (95% CI), EUR 1.65 (1.45 to 1.88); AFR 1.6 (1.27 to 2.02); AMR 1.86 (1.42 to 2.42)). This pattern persisted when broadening inclusion to symptomatic diverticulitis (All of Us Research Program EUR (OR (95% CI) 1.66 (1.55 to 1.79)), AMR (OR (95% CI) 1.63 (1.39 to 1.9)) and AFR (OR (95% CI) 1.27 (1.1 to 1.46)) (table 2). When considering a reference group of asymptomatic diverticulosis, the PRS persisted as a positive predictor in EUR and AMR but not AFR models (online supplemental material).
Performance metrics
In the All of Us Research Program, the AUROC (95% CI) in the EUR full model for severe diverticulitis was 0.72 (0.70 to 0.75) (base 0.70 (0.68 to 0.72), difference=0.02 (0.01 to 0.03)). The AUROC (95% CI) for models of severe diverticulitis were comparatively smaller in AFR (base 0.68 (0.63 to 0.73); full 0.7 (0.66 to 0.74), difference=0.02 (0.00 to 0.04)) but not AMR models (base 0.69 (0.65 to 0.74); full 0.73 (0.68 to 0.78), difference=0.03 (0.01 to 0.06)). The trend of performance decline from EUR to AFR models was mirrored in the BioVU models for symptomatic diverticulitis (EUR base 0.75 (0.73 to 0.76); full 0.77 (0.75 to 0.78), difference=0.02 (0.01 to 0.03). AFR base 0.70 (0.65 to 0.75); full 0.72 (0.67 to 0.77), difference=0.01 (0.00 to 0.03)). The largest AUROC was observed in the EUR full model for severe diverticulitis in BioVU (0.78 (0.75 to 0.81)) (online supplemental material).
In models for severe diverticulitis, the largest Nagelkerke’s pseudo-R2 was achieved in the EUR models (base 0.086; full 0.106) relative to corresponding AMR (base 0.077; full 0.105) or AFR (base 0.066; full 0.085) models (figure 3, online supplemental material). Similar trends were observed with the PRS-CSx score (online supplemental material).
Discussion
This multibiobank study found that positive associations between a PRS and diverticulitis persisted in AMR and AFR models with both PheWAS and risk stratification for severe disease. There were relative increases in model performance attributable to the PRS, but the magnitude of absolute change in discrimination was modest. With an optimistic transferability, the findings of this study suggest that genomic associations with diverticulitis severity subtypes may bring value to diverse clinical cohorts.
Diverticular disease is an encouraging candidate for clinical applications of genomics given a paucity of stratification approaches and estimates of heritability as high as 53%.37 Three PRSs have been systematically investigated.4 6 14 De Roo et al4 derived a 44-SNP PRS from a European UK Biobank discovery population using conditional and joint association analysis.4 Wu et al6 generated a PRS using SBayesR from their meta-analysis of 724 372 European participants in the UK Biobank, FinnGen and BioVU with subsequent validation in CARTaGENE participants.6 Schaeffer et al14’s 373-SNP and 851-SNP PRSs originated from application of PRSice-2 to summary statistics from UK Biobank European participants followed by validation in Geisinger MyCode participants.14 While methodologies and cohorts varied, all reported positive findings with respect to PRS validation in European individuals.
Our study’s PRS-PheWAS provided initial support for validity of a diverticular disease PRS across diverse ancestral backgrounds. Diverticular phecodes were consistently associated with the PRS, with the magnitude of association largest in AMR models. We identified conditions associated with heightened genetic risk for diverticular disease, including abdominal hernias and connective tissue diseases. The disappearance of hernia phecode significance when excluding diverticular cases suggests that this association was more likely driven by a concomitant diagnosis of diverticular disease rather than true pleiotropy. Plausible explanations include incidental diverticulosis identified after an abdominal imaging study obtained for hernias, or incisional hernias occurring after an abdominal operation for diverticulitis. Time-censored PheWAS would allow for further evaluation of these hypotheses.12 We also found that genetic susceptibility to diverticular disease was associated with lower odds of the parent phecode for diffuse connective tissue diseases, an unexpected result given that functional investigations implicate connective tissue biology in the onset of diverticular disease.5–7 9 It is also notable that this association was present in one subgroup (AMR) but not others (EUR, AFR). There may be differential contributions from the diagnosis codes encompassed by the parent phecode for diffuse diseases of connective tissue (709). Overall, our findings support continued focus on connective tissue biology for future investigations of pathogenesis.
Differential performance of PRSs across ancestral backgrounds is an important consideration given the diversity encountered in many clinical practices. In our study’s adjusted models, a positive association between the PRS and severe diverticulitis was found in EUR models, which persisted for AMR and AFR models. This trend of transferability suggests that diverticular disease loci from EUR discovery populations may retain predictive ability in other populations. However, absolute improvements in AUROC and Nagelkerke’s R2 were modest in all models. The performance of European PRSs often degrades when applied to other ancestries due to differences in allele frequencies, correlation patterns between SNPs or selection pressures.38–40 On the contrary, some have reported similar levels of performance in other diseases,41 while a prior study of diverticular disease identified correlation between effect sizes of individual SNPs across ancestry groups.6 Improving multiancestry applications of PRSs requires attention to practices of data collection, data analysis and return of information to participants. First, it is critical to improve the availability of genomic data in non-European populations, recently modelled through efforts from the All of Us Research Program and the Global Biobank Meta Analysis Initiative.18 42 For PRSs, the importance of adequately powered discovery GWASs in diverse populations is illustrated by our PRS-CSx sensitivity analysis. Even when adopting this PRS calculation method that is designed specifically for multiancestry cohorts, drastic differences between discovery GWAS sample sizes (EUR n=56 355 cases; AFR n=298 cases) likely contributed to the observed minimal improvement in performance. Second, continued work is needed in methods development specifically for cross-ancestry derivation and evaluation of PRSs.31 43 44 Finally, collaboration between clinicians, patients and genomic researchers is required to optimise the ways in which genomic risk information should be returned to participants. Recent examples from the eMERGE Network provide practical guidance towards this end.45 46
While unlikely that a diverticular disease PRS would dramatically alter existing stratification approaches, the future potential for such a tool is promising. The degree of variance attributable to the PRS in our study did not approach existing estimates of SNP-based heritability (11%), potentially due to methodology in SNP selection and effect size estimation for the PRS as well as phenotypic differences between discovery and validation populations.6 For example, existing GWASs have identified cohorts with any diverticular disease using diagnostic codes, but validation studies that evaluate clinical impact narrow the focus to symptomatic, severe or recurrent diverticulitis.4 14 Future GWAS of severe or recurrent diverticulitis might shed light on variants specifically linked with disease progression. Furthermore, estimates of broad-sense heritability (41%) suggest possible roles for rare variants or non-additive genetic effects in diverticular disease.6 Additional investigation is needed to uncover the underlying biological mechanisms responsible for observed genetic associations.
A well-performing diverticular disease PRS would bring clinical value. For one, it may inform discussions about elective colectomy for patients with prior severe or recurrent diverticulitis. A discrete choice analysis of colorectal surgeons has already demonstrated substantial receptivity to a genetic risk tool in this scenario.4 In addition, with the onset of endoscopy-based predictive models,47 48 a diverticulitis PRS may be considered during initial patient evaluation to assist with targeted selection for further workup. Finally, there is evidence that awareness from the return of personalised genomic profiles motivates positive behavioural change, which would be an important addition to diverticulitis care where lifestyle factors are strong risk modifiers.49 50
There are a number of limitations to this study. For both the All of Us Research Program and BioVU, we were unable to investigate participants similar to South Asian, East Asian or Middle Eastern reference samples due to sample size. The PRS used in this study originated from European discovery populations, and future matching of ancestry groups between discovery and validation populations may lead to different conclusions. We did not include diet and exercise in adjusted clinical models due to availability in the electronic medical records, and there are known limitations associated with using BMI as a proxy for central adiposity as well as ICD codes for smoking status. In addition, we did not include rare variants or gene by environment interactions though these are critical areas for future study.
Conclusions
Associations between a diverticular disease PRS and severe presentations persisted in diverse cohorts when controlling for known demographic and clinical risk factors. Relative improvements in model performance were observed, but magnitudes of absolute change were modest.
Data availability statement
No data are available. Individual-level data are not available to protect the privacy of biobank participants. Analysis workspaces and code can be shared upon request to approved All of Us Researcher Workbench users.
Ethics statements
Patient consent for publication
Ethics approval
This study was approved by the Vanderbilt University Medical Centre Institutional Review Board (IRB #230138). Work in All of Us was approved through a dedicated workspace using the Controlled Tier Dataset version 7, available to authorised users on the Researcher Workbench.
Acknowledgments
We would like to acknowledge Hannah Polikowsky, Lauren Petty and the Below Lab for expertise in genetic ancestry methodologies and imputation of microarray data. We also gratefully acknowledge All of Us and BioVU participants for their contributions, without whom this research would not have been possible. We thank the National Institutes of Health’s All of Us Research Programme for making available the participant data examined in this study.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Contributors TEU: conceptualisation, data curation, formal analysis, funding acquisition, investigation, methodology, software, validation, visualisation, writing—original; JDM: conceptualisation, funding acquisition, methodology, resources, visualisation, writing—review; CN: methodology, supervision, writing—review; JPS: methodology, resources, software, supervision, writing—review; JR, ERG, LM: conceptualisation, funding acquisition, methodology, resources, writing—review; RP: funding acquisition, project administration, resources, writing—review; ATH: conceptualisation, funding acquisition, project administration, resources, supervision, visualisation, writing—review, guarantor. All authors have reviewed and approved the final draft submitted.
Funding TEU and RP work were supported by National Institutes of Health award number T32DK007673. ATH’ work on this manuscript was supported by the National Institute of Diabetes and Digestive and Kidney Disease of the National Institutes of Health under award number K23DK118192. JDM’s work was supported by the National Institute of General Medical Sciences under award number R01GM130791.
Disclaimer The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Competing interests None declared.
Provenance and peer review Not commissioned; internally peer-reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.