Assistant Professor Irina Gayanova, Department of Statistics, Texas A&M University
Structural Learning and Integrative Decomposition of Multi-View Data
The increased availability of the multi-view data (data on the same samples from multiple sources) has led to strong interest in models based on low-rank matrix factorizations. These models represent each data view via shared and individual components, and have been successfully applied for exploratory dimension reduction, association analysis between the views, and further learning tasks such as consensus clustering. Despite these advances, there remain significant challenges in modeling partially-shared components, and identifying the number of components of each type (shared/partially-shared/individual). In this work, we formulate a novel linked component model that directly incorporates partially-shared structures. We call this model SLIDE for Structural Learning and Integrative DEcomposition of multi-view data. We prove the existence of SLIDE decomposition and explicitly characterize the identifiability conditions. The proposed model fitting and selection techniques allow for joint identification of the number of components of each type, in contrast to existing sequential approaches. In our empirical studies, SLIDE demonstrates excellent performance in both signal estimation and component selection. We further illustrate the methodology on the breast cancer data from The Cancer Genome Atlas repository. This is joint work with Gen Li.
Assistant Professor Stefan Wager, Stanford Graduate School of Business
Quasi-Oracle Estimation of Heterogeneous Causal Effects
Many scientific and engineering challenges, ranging from personalized medicine to customized marketing recommendations, require an understanding of treatment effect heterogeneity. In this paper, we develop a class of two-step algorithms for heterogeneous treatment effect estimation in observational studies. We first estimate marginal effects and treatment propensities to form an objective function that isolates the heterogeneous treatment effects, and then optimize the learned objective. This approach has several advantages over existing methods. From a practical perspective, our method is very flexible and easy to use: In both steps, we can use any method of our choice, e.g., penalized regression, a deep net, or boosting; moreover, these methods can be fine-tuned by cross-validating on the learned objective. Meanwhile, in the case of penalized kernel regression, we show that our method has a quasi-oracle property, whereby even if our pilot estimates for marginal effects and treatment propensities are not particularly accurate, we achieve the same regret bounds as an oracle who has a-priori knowledge of these nuisance components. We implement variants of our method based on both penalized regression and convolutional neural networks, and find promising performance relative to existing baselines.
Joint work with Xinkun Nie.
Professor Michael Sobel, Department of Statistics, Columbia University
Estimating Causal Effects in Studies of Human Brain Function: New Models, Methods and Estimands
Professor Martin Lindquist, Department of Biostatistics, Johns Hopkins University
Brain Signatures and Models in Translational Neuroimaging
Despite the great promise in using functional neuroimaging to map brain to mind and understand human health and brain disorders, thus far it has had minimal impact on clinical practice and public health. Here, we review emerging techniques that, if used judiciously, could enable a quantum leap forward in developing translational applications. First, we review the state of translational neuroimaging and emerging techniques. Then, we outline an approach that uses these new techniques in specific ways to develop brain signatures that can be shared, tested in multiple contexts, and used in translational settings. The approach brings together ideas from statistics, ‘big data,’ replicability, and open science— ideas that, together, constitute a cultural shift in science that is bringing translational goals within reach.
Professor Nicole Lazar, Department of Statistics, University of Georgia
Semiparametric estimation under shape invariance for fMRI data
Functional magnetic resonance imaging (fMRI) data pose many statistical challenges, owing to their size, noisiness, and complicated correlation structure. In this talk, I will give an introduction to fMRI data collection and analysis. Then, motivated by a study of practice effects, I will introduce a semiparametric functional data analysis approach under shape invariance for group comparisons. The components of this analysis suite include: function estimation using local polynomial regression; a shape invariant model for the relevant function estimates; evolutionary algorithms for parameter estimation. Taken together, these steps admit a principled comparison of practice effects within and across study groups of interest.
Associate Professor Ziyue Liu, Department of Biostatistics, Indiana University School of Public Health and School of Medicine
Fast Data Driven Adaptive Spline Smoothing
We propose a computationally efficient method for data-driven adaptive spline smoothing. Finite differences are adopted to characterize the roughness pattern of the underlying function. These differences are then clustered into several groups, each of which is modeled to have a separate penalty parameter. State space representation is developed for implementation. Simulation shows that the proposed method is fast in computation, with median computational time as 5%~30% of existing methods. It also shows that the proposed method works well in several typical functions for adaptive smoothing with respect to mean square errors. Application to a shock wave lithotripsy data example shows that the proposed method generates function estimates that agree with the corresponding physical properties.
Tianxi Li, Department of Statistics, University of Michigan
Statistical tools for analyzing network-linked data
While classic statistical tools such as regression and graphical models have been well studied, they are no longer applicable when the observations are connected by a network, an increasingly common situation in modern complex datasets. We develop the analogue of loss-based prediction models and graphical models for such network-linked data, by a network-based penalty that can be combined with any number of existing techniques. We show, both empirically and theoretically, that incorporating network information improves performance on a variety of tasks under the assumption of network cohesion, the empirically observed phenomenon of linked nodes acting similarly. Computationally efficient algorithms are developed as well for implementing our proposal. We also consider the general question of how to perform cross-validation and bootstrapping on networks, a long-standing open problem in network analysis. Model selection and tuning for many tasks can be performed through cross-validation, but splitting network data is non-trivial, since removing links leads to a potential change in network structure. We propose a new general cross-validation strategy for networks, based on repeatedly removing edge values at random and then applying matrix completion to reconstruct the full network. We obtain theoretical guarantees for this method under a low rank assumption on the underlying edge probability matrix, and show that the method is computationally efficient and performs well for a wide range of network tasks, in contrast to previously developed approaches that only apply under specific models. Several real-world examples will be discussed throughout the talk, including the effect of friendship networks on adolescent marijuana usage, phrases that can be learned with the help of a collaboration network of statisticians as well as statistician communities extracted from a citation network.
Jason Klusowski, Department of Statistics and Data Science, Yale University
Counting motifs and connected components of large graphs via subgraph sampling
Learning properties of large graphs from samples is an important problem in statistical network analysis. We revisit the problem of estimating the number of connected components in a graph of $N$ vertices based on the subgraph sampling model, where we observe the subgraph induced by $n$ vertices drawn uniformly at random. The key question is whether it is possible to achieve accurate estimation, i.e., vanishing normalized mean-square error, by sampling a vanishing fraction of the vertices. We show that it is possible by accessing only sublinear number of samples if the graph does not contains high-degree vertices or long induced cycles; otherwise it is impossible. We obtain optimal sample complexity bounds for several classes of graphs including forests, cliques, and more generally, chordal graphs, achieved by linear-time estimators. The methodology relies on topological identities of graph homomorphism numbers. They, in turn, also play a key role in proving minimax lower bounds based on constructing random instances of graphs with matching structures of small subgraphs. We will also discuss results for the neighborhood sampling model, where we additionally observe the edges between the sampled vertices and their neighbors. In this setting, we will show how to construct optimal estimators of motif counts that are adaptive to certain unknown graph parameters.
Min Xu, Postdoctoral Researcher, Department of Statistics, Wharton School of the University of Pennsylvania
Community Estimation on Weighted Networks
Community identification in a network is an important problem in fields such as social science, neuroscience, and genetics. Over the past decade, stochastic block models (SBMs) have emerged as a popular statistical framework for this problem. However, SBMs have an important limitation in that they are suited only for networks with unweighted edges; disregarding the edge weights may result in a loss of valuable information in various scientific applications. We propose a weighted generalization of the SBM where we model the probability distribution of the edge weights as a mixture whose latent components reflect the latent community structure of the network. In this model, observations comprise of a weighted adjacency matrix where the weight of each edge is generated independently from one of two unknown probability densities depending on whether the edge is within-community or between-community. We characterize the optimal rate of mis-clustering error of the weighted SBM in terms of the Renyi divergence order 1/2 between the weight distributions of within-community and between-community edges, substantially generalizing existing results for unweighted SBMs. Furthermore, we present a computationally tractable algorithm that is adaptive to the unknown edge weight densities in the sense that it achieves the same optimal error rate as if it had perfect knowledge of the edge weight densities.
Julia Fukuyama, Postdoctoral Research Fellow, Department of Computational Biology, Fred Hutchinson Cancer Research Institute
Dimension Reduction for Structured Variables
Studies of the microbiome, the complex communities of bacteria that live in and around us, present interesting statistical problems. In particular, bacteria are best understood as the result of a continuous evolutionary process and methods to analyze data from microbiome studies should use the evolutionary history. Motivated by this example, I describe adaptive gPCA, a method for dimensionality reduction that uses the evolutionary structure as a regularizer and to improve interpretability of the low-dimensional space. I also discuss how adaptive gPCA applies to general variable structures, including variables structured according to a network, as well as implications for supervised learning and structure estimation.