Contribution of Ambient Air Pollution on Risk Assessment of Type 2 Diabetes Mellitus via Explainable Machine Learning

DING Zhong Ao ZHANG Li Ying LI Rui Ying NIU Miao Miao ZHAO Bo DONG Xiao Kang LIU Xiao Tian HOU Jian MAO Zhen Xing WANG Chong Jian

DING Zhong Ao, ZHANG Li Ying, LI Rui Ying, NIU Miao Miao, ZHAO Bo, DONG Xiao Kang, LIU Xiao Tian, HOU Jian, MAO Zhen Xing, WANG Chong Jian. Contribution of Ambient Air Pollution on Risk Assessment of Type 2 Diabetes Mellitus via Explainable Machine Learning[J]. Biomedical and Environmental Sciences, 2023, 36(6): 557-560. doi: 10.3967/bes2023.069
Citation: DING Zhong Ao, ZHANG Li Ying, LI Rui Ying, NIU Miao Miao, ZHAO Bo, DONG Xiao Kang, LIU Xiao Tian, HOU Jian, MAO Zhen Xing, WANG Chong Jian. Contribution of Ambient Air Pollution on Risk Assessment of Type 2 Diabetes Mellitus via Explainable Machine Learning[J]. Biomedical and Environmental Sciences, 2023, 36(6): 557-560. doi: 10.3967/bes2023.069

doi: 10.3967/bes2023.069

Contribution of Ambient Air Pollution on Risk Assessment of Type 2 Diabetes Mellitus via Explainable Machine Learning

Funds: This work was supported by the Science and Technology Innovation Team Support Plan of Colleges and Universities in Henan Province [Grant NO:21IRTSTHN029]; Foundation of National Key Program of Research and Development of China [Grant NO: 2016YFC0900803]; Key Research Program of Colleges and Universities in Henan Province [Grant NO: 21A330007]; and Discipline Key Research and Development Program of Zhengzhou University [Grant NO: XKZDQY202008, XKZDQY202002]
More Information
    Author Bio:

    DING Zhong Ao, male, born in 1999, Postgraduate, majoring in epidemiology and biostatistics

    ZHANG Li Ying, female, born in 1988, PhD, Lecturer, majoring in machine learning and medical data mining

    Corresponding author: WANG Chong Jian, E-mail: tjwcj2008@zzu.edu.cn Tel: 86-371-67781452.
  • &These authors contributed equally to this work.
图(7) / 表ll (7)
计量
  • 文章访问数:  405
  • HTML全文浏览量:  177
  • PDF下载量:  37
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-11-03
  • 录用日期:  2023-04-06
  • 刊出日期:  2023-06-20

Contribution of Ambient Air Pollution on Risk Assessment of Type 2 Diabetes Mellitus via Explainable Machine Learning

doi: 10.3967/bes2023.069
    基金项目:  This work was supported by the Science and Technology Innovation Team Support Plan of Colleges and Universities in Henan Province [Grant NO:21IRTSTHN029]; Foundation of National Key Program of Research and Development of China [Grant NO: 2016YFC0900803]; Key Research Program of Colleges and Universities in Henan Province [Grant NO: 21A330007]; and Discipline Key Research and Development Program of Zhengzhou University [Grant NO: XKZDQY202008, XKZDQY202002]
    作者简介:

    DING Zhong Ao, male, born in 1999, Postgraduate, majoring in epidemiology and biostatistics

    ZHANG Li Ying, female, born in 1988, PhD, Lecturer, majoring in machine learning and medical data mining

    通讯作者: WANG Chong Jian, E-mail: tjwcj2008@zzu.edu.cn Tel: 86-371-67781452.
注释:

English Abstract

DING Zhong Ao, ZHANG Li Ying, LI Rui Ying, NIU Miao Miao, ZHAO Bo, DONG Xiao Kang, LIU Xiao Tian, HOU Jian, MAO Zhen Xing, WANG Chong Jian. Contribution of Ambient Air Pollution on Risk Assessment of Type 2 Diabetes Mellitus via Explainable Machine Learning[J]. Biomedical and Environmental Sciences, 2023, 36(6): 557-560. doi: 10.3967/bes2023.069
Citation: DING Zhong Ao, ZHANG Li Ying, LI Rui Ying, NIU Miao Miao, ZHAO Bo, DONG Xiao Kang, LIU Xiao Tian, HOU Jian, MAO Zhen Xing, WANG Chong Jian. Contribution of Ambient Air Pollution on Risk Assessment of Type 2 Diabetes Mellitus via Explainable Machine Learning[J]. Biomedical and Environmental Sciences, 2023, 36(6): 557-560. doi: 10.3967/bes2023.069
  • Type 2 diabetes mellitus (T2DM) is recognized as a heterogeneous and complicated disease that is able to influence individuals at various life stages[1]. Apart from traditional predictors such as age, family history of diabetes, body mass index, and so on, ambient air pollution is also shown to increase the risk of T2DM in previous studies. However, previous T2DM risk assessment models barely included air pollution features as the predictors. Machine learning algorithms are widely used for disease prediction model construction, and demonstrate superior discrimination abilities and greater effectiveness than statistical methods[2]. However, the principle of “black box” in machine learning greatly hindered the interpretability of the model, especially for medical decisions[3]. The SHapely additive exPlanations (SHAP) based on the game theory was proposed by Lundberg et.al to develop the explainable machine learning, and the SHAP methods were able to display the feature contributions as well as interaction effects in the model[4, 5]. This study aims to reveal the contribution of air pollutants exposure in the T2DM risk assessment model as well as air pollutants’ effects on traditional predictors via SHAP.

    Participants in this study were derived from the Henan Rural Cohort. A detailed description of this cohort study was posted previously[6] and the brief introduction was provided in the supplementary material. A total of 38,258 individuals were finally included in this analysis, and the flow chart of the data processing procedure is shown in Supplementary Figure S1 (available in www.besjournal.com). The air pollutants exposure of an individual was evaluated by a 3-year annual mean concentration of 4 ambient air pollutants, listed as the nitrogen dioxide (NO2) and particulate matter with an aerodynamic diameter ≤ 1.0 μm, ≤ 2.5 μm, ≤ 10.0 μm (PM1, PM2.5, PM10)[7]. The definitions of T2DM are listed as follows: (1) FBG ≥ 7.0 mmol/L; (2) T2DM patient diagnosed by doctors previously and used anti-glycemic drugs or insulin in the past two weeks. A detailed description of the exposure, outcome and covariates assessment methods were placed in the supplementary material.

    Figure S1.  The data processing flow chart of this study.

    In this study, we determined the 20 traditional variables and the air pollutants exposure-related variable as the candidate variables[2]. After variable selection, the Gradient Boosting Machine (GBM) was applied to model construction with selected variables in the analysis. To explain the effect of air pollutants in T2DM risk assessment models, SHAP was employed to show the contribution of predictors as an additive feature attribution method. A detailed description of the model development was provided in the supplementary material.

    In order to calculate the mixture of air pollutants exposure, the quantile g-computation was employed in this analysis. The calculating equation of this method is shown below; detailed description of the formulas was placed in the supplemental material.

    $$ \begin{aligned} QGS=&\left({\beta }_{PM_{2.5}}\times {PM}_{2.5}+{\beta }_{PM_{10}}\times {PM}_{10}\right)\times {S}_{1}+\\ &\left({\beta }_{PM_1}\times {PM}_{1}+{\beta }_{NO_2}\times {NO}_{2}\right)\times {S}_{2} \end{aligned}$$ (1)

    When describing the characteristics of predictors, numbers (frequencies) were used for categorical variables and mean ± Standard Deviation was used for continuous variables. The chi-square test (or Fisher’s exact test) was used for comparisons between categorical variables, whereas the t-test was used for continuous variables. The area under the curve (AUC) of the receiver operating characteristic curve (ROC) was used to evaluate the discriminative performance and the brier score (BS) was employed for calibration evaluation. For the comparison of AUCs, DeLong test was used. It was considered statistically significant when a double-tailed P value was less than 0.05. Statistical tests were performed using R 3.6.2 and SPSS 21.0 (IBM, Chicago, USA).

    A total of 38,258 individuals were included in the analysis, and 3,564 T2DM patients were found in the overall study. Compared with the individuals with non-T2DM, those with T2DM tended to be older, fatter, and their heart rate as well as pulse pressure were higher than healthy individuals (P < 0.05). Detailed characteristics are shown in Supplementary Table S1 and Supplementary Table S2 (available in www.besjournal.com). Coefficients of the quantile g-computation are shown in Supplementary Table S3 (available in www.besjournal.com). After adjusting for covariates, there existed an association of air pollutants mixture with T2DM risk (odds ratio, OR 1.22, 95% CI 1.16–1.27). After stratifying the QGS by the tertiles, the subgroups all indicated this association in this analysis [OR 1.30 (1.18, 1.43), 1.44 (1.31, 1.59), P < 0.001], suggesting that higher exposure of air pollutants increased the prevalence risk of T2DM. The detailed information is shown in Table 1. The Principal Component Analysis and the air pollution score also indicated the tendency, and detailed information could be found in Supplementary Table S4 (available in www.besjournal.com). Although previous research confirmed the effects of long-term exposure to ambient air pollution on T2DM, the association of a mixture of air pollutants with T2DM prevalence was still unknown. Consistent with the results of previous studies[8], we employed three mixing approaches to validate that higher air pollutants exposure increased the risk of T2DM in this analysis.

    Table 1.  Associations (ORs and 95% CI) of the mixture of ambient air pollutants with T2DM

    VariablesModel 1Model 2Model 3
    T1*1.00 (Ref.)1.00 (Ref.)1.00 (Ref.)
    T21.53 (1.34, 1.68)1.36 (1.24, 1.50)1.30 (1.18, 1.43)
    T31.83 (1.68, 2.00)1.54 (1.41, 1.69)1.44 (1.31, 1.59)
    Each score change1.34 (1.29, 1.41)1.25 (1.20, 1.31)1.22 (1.16, 1.27)
      Note. Model 1 was the crude model; Model 2 adjusted for the age, gender, BMI, waist-to-hip ratio, pulse pressure and heart rate; Model 3 was further adjusted for more vegetable and fruit intake, physical activity and family history of diabetes. *Different groups were divided by the tertiles of the QGS. OR, odds ratio. CI, confidence interval; T1–T3 were the tertiles of QGS.

    Table S1.  Characteristics of the traditional T2DM predictors

    Characteristics Non-T2DM
    (n = 34,694)
    T2DM
    (n = 3,564)
    P value*
    Age, mean ± SD, years55.13 ± 12.3460.47 ± 9.25< 0.001
    Men (n, %)13,691 (39.46)1,340 (37.60)0.031
    Body mass index, mean ± SD, kg/m224.67 ± 3.5026.13 ± 3.66< 0.001
    Waist-to-hip ratio, mean ± SD0.88 ± 0.070.93 ± 0.07< 0.001
    Heart rate, mean ± SD, beats/minute75.34 ± 10.9879.62 ± 12.21< 0.001
    Pulse pressure, mean ± SD, mmHg47.69 ± 12.8353.36 ± 14.23< 0.001
    More vegetable and fruit intake (yes, n, %)14,787 (42.62)1,269 (35.61)< 0.001
    Physical activity (n, %)
     Low10,915 (31.46)1,389 (38.97)< 0.001
     Moderate13,296 (38.32)1,268 (35.58)< 0.001
     High10,483 (30.22)907 (25.45)< 0.001
    Family history of T2DM (yes, n, %)1,229 (3.54)352 (9.88)< 0.001
      Note. SD indicated standard error;
    *Student’s t-test was used to compare the mean difference of continuous variables; Chi-square test was used to test the distributions of categorical variables.

    Table S2.  Characteristics of the ambient air pollutants

    Characteristics Non-T2DM
    (n = 34,694)
    T2DM
    (n = 3,564)
    P value*
    NO2, mean ± SD, μg/m339.80 ± 3.6140.66 ± 3.53< 0.001
    PM1, mean ± SD, μg/m357.41 ± 2.6757.81 ± 2.67< 0.001
    PM2.5, mean ± SD, μg/m373.36 ± 2.5873.95 ± 2.50< 0.001
    PM10, mean ± SD, μg/m3132.32 ± 5.85133.76 ± 5.61< 0.001
      Note. *Student’s t-test was used to compare the mean difference of continuous variables.

    Table S3.  Coefficients of the quantile g-computation in this study

    ItemsScaled effect size (β)*
    Positive direction
     PM2.50.709
     PM100.291
     Sum of positive coefficients (S1)0.407
    Negative direction
     NO20.501
     PM10.439
     Sum of negative coefficients (S2)−0.194
      Note. *Adjusted for the age, gender and physical activity.

    After the univariate logistic regression and collinearity diagnosis, nine variables (age, gender, family history of diabetes, more vegetable and fruit intake, physical activity, body mass index, waist-to-hip ratio, pulse pressure, and heart rate) were finally chosen as traditional predictors. The GBM model contained air pollutants exposure got good discrimination (AUC 0.787) and acceptable calibration (brier score, BS 0.076), better than the traditional model (AUC 0.764, BS 0.079). The detailed information can be found in Table 2 and Supplementary Table S5 (available in www.besjournal.com). The results showed that air pollution posted as a hazardous factor for T2DM, while ambient air pollution can also improve the prediction performance of traditional models to some contents.

    Table 2.  Comparison of the performance metrics with and without air pollutants

    VariablesTraditional & air pollutants modelTraditional model
    Sensitivity (95% CI)0.805 (0.774, 0.834)0.823 (0.793, 0.851)
    Specificity (95% CI)0.631 (0.619, 0.642)0.578 (0.566, 0.590)
    AUC (95% CI)0.787 (0.777, 0.796)0.764 (0.754, 0.774)
    BS0.0760.079
      Note. AUC, area under the curve; BS, brier score; CI, confidence interval.

    The output of SHAP supplied an approach to explain the complex relationships in the GBM model. In Supplementary Figure S2 (available in www.besjournal.com), waist-to-hip ratio (WHR) ranked first in the SHAP value ranking (SHAP mean value 0.509). However, when adding air pollutants variable into the model, the air pollutants exposure ranked fifth (SHAP mean value 0.238), simultaneously altering the order of traditional predictors in Supplementary Figure S3, (available in www.besjournal.com). Additionally, the summary plot is chosen to indicate the effect direction between predictors and T2DM (Figure 1). Air pollutants exposure performed well in the plot with a long right tail, which indicated that a high concentration of ambient air pollution led to an increased prevalence risk of T2DM. Additionally, the asymmetric distribution of effect magnitudes that air pollutants exposure had on T2DM predicted cases demonstrated non-linear associations between air pollutants exposure and the risk of T2DM[9]. The SHAP summary plot exceedingly provided vital evidence on the hazardous effect of air pollution, which was consistent with previous statistical analysis[8]. SHAP proposed a rich visualization of feature contributions based on individuals, which indicated that air pollution elevated the risk of T2DM in an intricate way along with other features. The interaction plot was also employed to present the complex effects in the model. An interesting interaction effect can be found between age and air pollutants. In Supplementary Figure S4 (available in www.besjournal.com), a step-by-step increasing tendency was shown in individuals aging from 40 years to 60 years. However, when considering air pollutants exposure of different ages, elder individuals (age > 60) with higher air pollutants exposure seemed to be more dangerous, while younger individuals (age < 40) with higher air pollutants exposure had lower SHAP values (shown in Supplementary Figure S4). The participants aged 27–30 years drag down the SHAP value for nearly 0.2–0.3 points. Similar interaction effects were also observed in other variables (Supplementary Figure S5 and Supplementary Figure S6, available in www.besjournal.com). Wang et al. also employed the deep learning neural networks with SHAP to explain prediction for mental disorders[10]. Consistent with that, the results of SHAP analysis visualized the complex interaction effects.

    Figure 1.  Feature importance ranking of 9 variables in the model. This summary plot illustrated the entire distribution of impacts each feature has on the model output. WHR, waist-to-hip ratio

    Figure S2.  Feature importance of GBM by summing of SHAP value magnitudes (the bar plot of traditional predictors). SHAP, SHapely additive exPlanations. GBM, Gradient Boosting Machine

    Figure S3.  Feature importance of GBM by summing of SHAP value magnitudes (the bar plot of predictors considering air pollutants).

    Previous studies have indicated the hazardous effect of air pollutants. However, no research had explored the role of air pollution in T2DM risk assessment to our best knowledge. Moreover, although SHAP with machine learning models was already applied to the air pollution research, the impacts of air pollution on T2DM were still unclear. To our knowledge, this is the first study that focuses on the effects of ambient air pollutants on T2DM resorting to SHAP. The GBM algorithm also accounts for the non-linear interactions which cannot be adequately modeled in statistical models, and the SHAP richly visualizes the interactions and feature contributions. However, limitations also exist in this study. We conducted this analysis in a cross-sectional study with no follow-up data. Moreover, the biological mechanism needs to be further investigated. Future studies can focus on the etiology pathway of air pollutants-caused T2DM.

    In summary, the consideration of personal air pollution exposure elevated the identification performance of T2DM cases in the T2DM risk assessment model. Additionally, the explainable machine learning method (SHAP) also reveals the contributing effects of mixture of ambient air pollution as well as its interaction effects with tradition predictors such as age. The study demonstrates the significance of considering environmental pollution exposure as the risk factor, which facilitates the prevention and management of T2DM. The human health is influenced by the interaction between the environment and the individual’s condition, and it is therefore significant to further investigate the contribution of incorporating the personal environmental exposures in the risk assessment models which for the primary care physicians' ability to assess the risk of developing chronic diseases.

    No potential conflicts of interest were disclosed.

    The authors thank all of the participants, coordinators, and administrators for their support and help during the research.

    DING Zhong Ao took part in the investigation, methodology and writing of the original draft. ZHANG Li Ying took part in the investigation, data curation, formal analysis and writing of the code. LI Rui Ying, NIU Miao Miao, ZHAO Bo, DONG Xiao Kang, LIU Xiao Tian, HOU Jian and MAO Zhen Xing reviewed the manuscript. WANG Chong Jian took part in the conceptualization, methodology, investigation, validation, supervision, funding acquisition, project administration and review of the manuscript.

    Table S4.  Associations (ORs and 95% CI) of the mixture of ambient air pollutants with T2DM

    VariablesModel 1Model 2Model 3
    AAP score
     Each unit change1.03 (1.02, 1.03)1.02 (1.02, 1.02)1.02 (1.01, 1.02)
     T1*1.00 (Ref.)1.00 (Ref.)1.00 (Ref.)
     T21.54 (1.41, 1.68)1.36 (1.23, 1.49)1.29 (1.17, 1.42)
     T31.81 (1.66, 1.98)1.55 (1.41, 1.70)1.45 (1.31, 1.60)
    APS
     Each unit change1.02 (1.02, 1.02)1.01 (1.01, 1.02)1.01 (1.01, 1.01)
     T1*1.001.001.00
     T21.52 (1.39, 1.66)1.33 (1.21, 1.46)1.26 (1.15, 1.39)
     T31.83 (1.68, 2.00)1.57 (1.43, 1.72)1.47 (1.33, 1.62)
      Note. AAP score, the ambient air pollution score; APS, the air pollution score; T1-T3 were the tertiles of the AAP score and APS; Model 1 was the crude model; Model 2 adjusted for the age, gender, BMI, waist-to-hip ratio, pulse pressure and heart rate; Model 3 was further adjusted for more vegetable and fruit intake, physical activity and family history of diabetes. *Different groups were divided by the tertiles of the AAP score and APS.

    Table S5.  Performance metrics of different machine learning classifiers

    AUC (traditional)BS (traditional)AUC
    (traditional & air pollutants exposure*)
    BS
    (traditional & air pollutants exposure*)
    GBM0.7640.0790.7870.076
    RF0.7610.0820.7620.082
    ANN0.7120.0860.7320.081
      Note. *Calculated by the quantile g-computation. BS, brier score.

    Figure S4.  Main and interaction effect of important features of GBM. (A) SHAP dependence plot for age on T2DM. (B) SHAP dependence plot of age with interaction of air pollutants exposure.SHAP, SHapely additive exPlanations. GBM, Gradient Boosting Machine

    Figure S5.  Main and interaction effect of the pulse pressure (PP) with QGS (A) SHAP dependence plot for PP on T2DM (B) SHAP dependence plot of PP with interaction of QGS.SHAP, SHapely additive exPlanations.

    Figure S6.  Interaction effect of all features of GBM in the traditional & APE model. FamTD indicates family history of diabetes. Different colors are linked with levels of variables, and red means higher level, blue means lower level. GBM, Gradient Boosting Machine

参考文献 (10)
补充材料:
22376+Supplementary Materials.pdf

目录

    /

    返回文章
    返回