Find Paper, Faster
Example:10.1021/acsami.1c06204 or Chem. Rev., 2007, 107, 2411-2502
Response to Comment on “Censoring Trace-Level Environmental Data: Statistical Analysis Considerations to Limit Bias”
Environmental Science & Technology  (IF9.028),  Pub Date : 2021-10-27, DOI: 10.1021/acs.est.1c06431
Barbara Jane George, Kent W. Thomas, Jane Ellen Simmons

We appreciate the thoughtful commentary and opportunity to continue the discussion on this important topic of dealing with left-censored data. We thank Prof. Hites for reflecting on our introduction and want to reinforce that the challenges for measurement of trace-level environmental data and their statistical analysis are long-standing, complex, and continually evolving. While there is extensive literature that addresses estimation of the mean in the presence of nondetects, there are relatively few recent papers describing advances in statistical approaches and software capability. We chose to build on the literature by assessing bias in means for Type I left-censored data, using freely available modern statistical software. A limitation of our dibenzo[a,h]anthracene (DBA) case study is that it is based on a single data set (n = 47) where eight concentration measurements were below the MDL and 26 below the CCLV. These data illustrate the need to understand and limit bias introduced by the handling of nondetects, and our work turned to simulation to assess moderately and highly skewed log-normal data that, by design, complemented the case study. Each estimated distribution mean and standard deviation (SD) from our simulation study, shown in Figure 2, (1) represents the average from 1000 data sets with sample size n = 50. The estimates are for uncensored samples and for samples where the lowest 30%, 50%, and 80% of the data was censored. Similarly, each estimate for sample size n = 20 in Figures S6–S9 (1) represents the average from 1000 data sets. After censoring each simulated data set, the best-fitting of normal, log-normal, and gamma distributions was selected for use in maximum likelihood estimation (MLE) and robust regression on order statistics, approaches that use a distribution assumption. The simulation study is abstract in that its data are not actual analytical measurements, but it has key advantages over a single data set as used in our case study. We noted in the discussion that important distinctions of the simulation study relative to the case study are “(1) the true mean and variance of the underlying population distributions are known, (2) bias estimation based on many samples is, in expectation, more accurate, and (3) the bias estimates are tied only to the distributional characteristics, enhancing generalizability beyond DBA”. (1) The commentary suggests that the case study means did not distinguish the approaches. We propose using confidence intervals from the simulation study to compare the approaches. Confidence intervals characterize uncertainty in estimates of the mean. (2) We assessed confidence interval coverage, that is, the percentage of simulated samples where the confidence interval contained the true mean, and found the statistical approaches varied in their estimation performance (Figures S10 and S11 (1)). The takeaway is that statistical approaches differ across distributional assumptions, distribution skewness, sample sizes, and amount of censoring. In our simulation study, MLE provided confidence intervals closest to their nominal 95% coverage probability for all three censoring levels. Modern statistical approaches may be especially appropriate when a majority of measurements has been censored, as illustrated for 80% censoring, for example, in our paper and Shumway et al. (2002). (3) In sum, the biasing effects of censoring depend on both the statistical approach and the data. Whether or not bias is considered substantive is relative to the research question(s) being addressed. The commentary suggests that medians are often used for left-censored data. MLE is a modern statistical approach that is used to estimate distribution quantiles such as medians. For continuous Type I censored data, the likelihood function combines the joint density function of the uncensored (measured) values and the cumulative distribution function of the censored values. (2) MLE is a good choice in that its estimates have the intuitive appeal of being the most likely for the observed data. (4) The R package EnvStats has functions to estimate distribution medians and their confidence intervals, for example, eqlnormCensored function with p = 0.5, assuming log-normally distributed data (2) (example code in Supporting Information). Sample geometric means and standard deviations for uncensored data may be estimated using the geoMean and geoSD functions; it is noteworthy that the sample geometric mean is the MLE estimate for the distribution median for log-normally distributed data. (2) A straightforward approach for establishing the functional form of a probability distribution uses distribution parameters; the mean and SD estimates in Figure 2 (1) may be used to estimate the normal, log-normal, and gamma probability distributions corresponding to the simulated samples. (2) Alternatively, an empirical cumulative distribution function (ecdf) may be used in plotting observed quantiles to yield a sample-based distribution function. The EnvStats cdfCompareCensored function plots both the ecdf for a sample containing nondetects and, for comparison, its estimated probability cumulative distribution function. The EnvStats gofTestCensored function also assesses distributional goodness-of-fit. Other R packages for censored data additionally offer functions to estimate distributions and assess goodness-of-fit. For those skilled in use of Excel, there are built-in R functions that read in data stored in CSV and XLSX files (Example S1 uses the read.csv function (1)). From Cressie (1994), (5) “what the measurement process takes away, the statistical analysis process tries to recreate!” For the vast number of existing data sets containing censored trace-level environmental data, modern statistical approaches and methods offer powerful tools for estimation that limits bias. We agree with Prof. Hites’ guidance on the importance of designing studies to use analytical methods with sufficient sensitivity to prevent excess nondetects when such methods are available or can be readily developed. We view the greatest benefit will derive from coupling his advice with this advice: collaborate with well-qualified statisticians beginning at experimental design and report, along with the analytical methods used in calculations, the limits of detection, any censoring thresholds, all data quality indicators, and all individual measurement data values, appropriately flagged, so that modern statistical methods in readily available software can be used, limiting bias introduced by censored nondetects. The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.est.1c06431.
  • Example R code (PDF)
Example R code (PDF) Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html. This response to comment has been subjected to Agency review and approved for publication. Approval does not signify that contents necessarily reflect the views and policies of the Agency and no official endorsement should be inferred. The mention of trade names or commercial products does not constitute endorsement or recommendation for use. We thank Drs. J. Zambrana, Jr., and A. R. Olsen for thoughtful review of this response to comment. This article references 5 other publications.