|
|
Invited SessionsDouble/Debiased Machine Learning
In problems involving missing data or causal inference, biometricians may wish to estimate one or a small number of quantities of interest but find that this requires estimating more complex 'nuisance' functions. For example, we might wish to estimate the average causal effect (ACE) of an exposure, defined as the expected difference between an individual's outcome if exposed and outcome if unexposed, from a sample where exposure is not randomly assigned. Commonly used techniques for this are inverse probability weighting (IPW) and regression imputation (RI). Both involve estimating a nuisance function: the conditional probability of exposure for IPW and the conditional expectation of the outcome for RI. Parametric models could be specified for these, but it is tempting to use flexible machine-learning techniques, to reduce the risk of model misspecification. There is, however, a problem associated with naive use of such methods for this purpose: machine-learning estimators of nuisance functions typically converge slowly, and this slow convergence may affect the convergence rate of the estimator of the quantity of ultimate interest, e.g. the ACE. Such slow convergence greatly complicates the construction of valid confidence intervals. Debiased machine learning is a group of techniques designed to address this problem. I shall provide an introduction to these techniques.
Quantifying causal effects in the presence of complex and multivariate outcomes is a key challenge to evaluate treatment effects.For hierarchical multivarariates outcomes, the FDA recommends the Win Ratio and Generalized Pairwise Comparisons approaches.However, as far as we know, these empirical methods lack causal or statistical foundations to justify their broader use in recent studies.To address this gap, we establish causal foundations for hierarchical comparison methods. We define related causal effect measures, and highlight that depending on the methodology used to compute Win Ratios or Net Benefits of treatments, the causal estimand targeted can be different, as proved by our consistency results. Quite dramatically, it appears that the causal estimand related to the historical estimation approach can yield reversed and incorrect treatment recommendations in heterogeneous populations, as we illustrate through striking examples. In order to compensate for this fallacy, we introduce a novel, individual-level yet identifiable causal effect measure that better approximates the ideal, non-identifiable individual-level estimand. We prove that computing Win Ratio or Net Benefits using a Nearest Neighbor pairing approach between treated and controlled patients, an approach that can be seen as an extreme form of stratification, leads to estimating this new causal estimand measure. We extend our methods to observational settings via propensity weighting, distributional regression to address the curse of dimensionality, and a doubly robust framework. We prove the consistency of our methods, and the double robustness of our augmented estimator. These methods are straightforward to implement, making them accessible to practitioners. Finally, we validate our approach using synthetic data, and observational oncology dataset.
Recent work has focused on nonparametric estimation of conditional treatment effects, but inference has remained relatively unexplored. We propose a class of nonparametric tests for both quantitative and qualitative treatment effect heterogeneity. The tests can incorporate a variety of structured assumptions on the conditional average treatment effect, allows for both continuous and discrete covariates and does not require sample splitting. Furthermore, we show how the tests are tailored to detect alternatives where the population impact of adopting a personalised decision rule differs from using a rule that discards covariates. The proposal is thus relevant for guiding treatment policies. The utility of the proposal is borne out in simulation studies and a re-analysis of an AIDS clinical trial. This is joint work with Mats Stensrud, Riccardo Brioschi and Aaron Hudson
Statistical Methods for sustainable management of natural resources
Increased availability of genetics data has made it possible to attain sufficient samples sizes for population level applications in fisheries science. In this presentation, I will talking about the statistical methods underlying the close-kin mark-recapture approach, which is increasingly used for abundance estimation in support of fisheries management or conservation. The approach consists of recapturing individuals via their relatives, identifying them using genetic information such as single-nucleotide polymorphism markers (SNP). Various simplifying assumptions can be made to overcome data limitations, such as uncertain or absent age information for sampled individuals. Two marine fish species with contrasting biology, thornback ray and meagre, will serve as case studies to evaluate the sensitivity of results to different modelling decisions and illustrate the additional biological insights that can be gained.
Understanding the impact of climate change on tropical rainforest ecosystems is crucial to promote efficient conservation strategies. The classical approach remains the use of species-specific distribution model. However, in species-rich ecosystems with many rare species, such an approach is doomed to failure. Moreover, univariate approaches ignore species dependencies. However,biodiversity is not merely the sum of species but the result of multiple interactions. Modeling multivariate count data allowing for flexible dependencies as well as zero inflation and overdispersion is challenging. In this presentation, we develop a new family of models called the zero-inflated binary tree Pólya-splitting models. This family allows the decomposition of a multivariate count data into a successive sub-model along a known binary partition tree. In the first part, I will present the general form of the zero-inflated binary tree Pólya-splitting model, studying the properties of the specified model in terms of marginal and conditional properties (distribution and moment). The second part presents the extension to the regression context. Finally, we finish presenting results on a real case study based on an impressive dataset consisting of the abundance of more than 180 tree taxa sampled on 1,571 plots covering more than 6 million hectares from the Congo Basin tropical rainforests.
Citizen science data have become a key source for understanding ecological systems and informing conservation, but working with these data brings many statistical challenges. Key challenges to address analytically include spatial and temporal biases, observer preferences, and observer experience. I’ll outline how we’ve approached these challenges to analyse eBird data, to learn more about avian ecology. eBird is the largest biodiversity citizen science project in the world and contains over 1 billion bird observations contributed by over a million participants. We can use these data to estimate bird species distributions, migratory movements, demographics, and population trends. Each of these targets of ecological inference requires different consideration of the datasets and different analytical methods. I’ll outline how we use a spatio-temporal ensemble of machine learning models to estimate bird distributions and migratory movements, and use double machine learning methods to account for confounding factors in estimating bird population trends. I’ll also outline how we can for the first time use large-scale observational data to estimate demographic parameters. All of these methods allow us to create new ecological knowledge from unstructured citizen science data. Knockoff filtering for high-dimensional feature selection
|
Online user: 2 | Privacy | Accessibility |
![]() ![]() |