# 2017 Rao Prize Conference - Abstracts

## Main Content

### Essential Concepts of Causal Inference in randomized experiments and observational studies: a remarkable history

Donald Rubin

This title may appear provocative because of its "post-colon” tag. But I think that it is an accurate characterization for a few reasons, which will be developed. First, I believe that the essential ideas underlying valid statistical inference for causal effects are less than a century old, and really are related to deeper ideas maturing in 20 century science at the same time. Second, I believe that these essential ideas in causal inference evolved from ones that were first mathematically formalized within the relatively narrow context of ideal randomized experiments, first proposed by Fisher in 1925, and then grew from being somewhat naively applied to more complex situations with humans. But these classical insights never entered the world of causal inference in non-randomized studies for a half century, and are still only partially understood by many practitioners who rely on nonrandomized studies to draw causal inferences.

A clustering problem arising in psychiatry

Satish Iyengar

Current psychiatric diagnoses are based primarily on self-reported experiences. Unfortu- nately, treatments for the diagnoses are not effective for all patients. One hypothesized reason is that “artificial grouping of heterogeneous syndromes with different pathophysio- logical mechanisms into one disorder.” To address this problem, the US National Institute of Mental Health instituted the Research Domain Criteria framework in 2009. This re- search framework calls for integrating data from many levels of information: genes, cells, molecules, circuits, physiology, behavior, and self-report. Clustering comes to the forefront as a key tool in this effort. In this talk, I present a case study of the use of mixture models to cluster older adults based on measures of sleep from three domains: diary, actigraphy, and polysomnography. Challenges in this effort include the use of mixtures of skewed distri- butions, a large number of potential clustering variables, and seeking clinically meaningful solutions. We present novel variable selection algorithms, study them by simulation, and demonstrate our methods on the sleep data. This work is joint with Meredith Wallace.

Addressing bias from unmeasured dispositions in observational studies

Paul Rosenbaum

There are two treatments, each of which may be applied or withheld, yielding a 2×2 factorial arrangement with three degrees of freedom between groups. The differential effect of the two treatments is the effect of applying one treatment in lieu of the other. In randomized experiments, the differential effect is of no more or less interest than other treatment contrasts. Differential effects play a special role in certain observational studies in which treatments are not assigned to subjects at random, where differing outcomes may reflect biased assignments rather than effects caused by the treatments. Differential effects are immune to certain types of unobserved bias, called generic biases, which are associated with both treatments in a similar way. This is exemplified using three familiar models, a Rasch model, a symmetric multivariate logit model and a preference tree model. Differential effects are not immune to differential biases, whose possible consequences are examined by sensitivity analysis. Under certain conditions, the differential comparison of two treatments balances other treatments, including unmeasured treatments, that are governed by the same unmeasured disposition. Three scientific examples are presented.

Big data, Google and disease detection: the statistical story

Samuel Kou

Big data collected from the internet have generated significant interest in not only the academic community but also industry and government agencies. They bring great potential in tracking and predicting massive social activities. We focus on tracking disease epidemics in this talk. We will discuss the applications, in particular, Google Flu Trends, some of the fallacy and the statistical implications. We will propose a new model that utilizes publicly available online data to estimate disease epidemics. Our model outperforms all previous real-time tracking models for influenza epidemics at the national level of the US. We will also draw some lessons for big data applications.

Combining Experimental and Non-Experimental Design in Causal Inference

Kari Lock Morgan

“Design Trumps Analysis” is a hallmark phrase of Don Rubin. In this spirit, this talk will focus on both experimental and non-experimental design for causal inference. The study involved is designed to assess the impact of Knowledge in Action (KIA), a form of project-based learning, in Advanced Placement (AP) Courses. The study contains both a randomized experiment and an observational study. Rerandomization was used in the design of the experiment, and this method will be described, as well as its impact on covariate balance, precision, power, and inference. Propensity score matching is used for the observational study. This talk will discuss the design of each study component separately, experimental and non-experimental, and then compare the two for estimating two-year effect of KIA, resulting in a new variation on an old theme: a bias-variance trade-off.

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Joseph Schafer

Causal inference with a continuous treatment is a relatively under-explored problem. In this dissertation, we adopt the potential outcomes framework. Potential outcomes are responses that would be seen for a unit under all possible treatments. In an observational study where the treatment is continuous, the potential outcomes are an uncountably infinite set indexed by treatment dose. We parameterize this unobservable set as a linear combination of a finite number of basis functions whose coefficients vary across units. This leads to new techniques for estimating the population average dose-response function (ADRF). Some techniques require a model for the treatment assignment given covariates, some require a model for predicting the potential outcomes from covariates, and some require both. We develop these techniques using a framework of estimating functions, compare them to existing methods for continuous treatments, and simulate their performance in a population where the ADRF is linear and the models for the treatment and/or outcomes may be misspecified. We also extend the comparisons to a data set of lottery winners in Massachusetts. Next, we describe the methods and functions in the R package causaldrf using data from the National Medical Expenditure Survey (NMES) and Infant Health and Development Program (IHDP) as examples. Additionally, we analyze the National Growth and Health Study (NGHS) data set and deal with the issue of missing data. Lastly, we discuss future research goals and possible extensions.

Estimating the Malaria Attributable Fever Fraction Accounting for Parasites Being Killed by Fever and Measurement Error

Dylan Small

Malaria is a parasitic disease that is a major health problem in many tropical regions. The most characteristic symptom of malaria is fever. The fraction of fevers that are attributable to malaria, the malaria attributable fever fraction (MAFF), is an important public health measure for assessing the effect of malaria control programs and other purposes. Estimating the MAFF is not straightforward because there is no gold standard diagnosis of a malaria attributable fever; an individual can have malaria parasites in her blood and a fever, but the individual may have developed partial immunity that allows her to tolerate the parasites and the fever is being caused by another infection. We define the MAFF using the potential outcome framework for causal inference and show what assumptions underlie current estimation methods. Current estimation methods rely on an assumption that the parasite density is correctly measured. However, this assumption does not generally hold because (i) fever kills some parasites and (ii) the measurement of parasite density has measurement error. In the presence of these problems, we show current estimation methods do not perform well. We propose a novel maximum likelihood estimation method based on exponential family g-modeling. Under the assumption that the measurement error mechanism and the magnitude of the fever killing effect are known, we show that our proposed method provides approximately unbiased estimates of the MAFF in simulation studies. A sensitivity analysis can be used to assess the impact of different magnitudes of fever killing and different measurement error mechanisms. We apply our proposed method to estimate the MAFF in Kilombero, Tanzania. This is joint work with Kwonsang Lee.