Spring 2015 Colloquium Series

Tuesday, January 13, 2015

4:00 PM

***

4/27/15

Professor Jeff Gill, Department of Political Science, the Division of Biostatistics, and Department of Surgery (Public Health Sciences) at Washington University.

Dynamic Elicited Priors for Updating Covert Networks

The study of covert networks is plagued by the fact that individuals conceal their attributes and associations.  To address this problem, we develop a technology for eliciting this information from qualitative subject matter experts to inform statistical social network analysis.  We show how the information from the subjective probability distributions (SPDs) can be used as input to Bayesian hierarchical models for network data.  In the spirit of ``proof of concept,'' the results of a test of the technology are reported. Our findings show that human subjects can use the elicitation tool effectively, supplying attribute and edge information to update a network indicative of a covert one.

Jeff Gill is Professor of Statistics in the Department of Political Science, the Division of Biostatistics, and Department of Surgery (Public Health Sciences) at Washington University. His research applies Bayesian modeling and data analysis (decision theory, testing, model selection, elicited priors) to questions in general social science quantitative methodology, political behavior and institutions, medical/health data analysis especially cancers related to obesity, pediatric traumatic injury, and epidemiological measurement/data issues, using computationally intensive tools (Monte Carlo methods, MCMC, stochastic optimization, non-parametrics).

***

4/20/15

Doug Nychka, National Center for Atmospheric Research

Spatial methods that combine ideas from wavelets and lattices

Kriging is a non-parametric regression method used in geostatistics  for estimating curves and surfaces and forms the core of most statistical methods for spatial data. In climate science these methods are very useful for estimating how climate varies over a geographic region when the observational data is sparse or the computer model runs are limited. A statistical challenge is to implement spatial methods for large sample sizes, a common feature of many geophysical problems. Here a new family of covariance models is proposed that expands the field in a set of basis functions and places a Gaussian Markov random field (GMRF) latent model on the basis coefficients. The idea, in contrast to fixed rank Kriging, is to use many basis functions organized on lattices. In addition, the basis functions add more smoothness and larger scale spatial dependence that a GMRF alone.  A practical example is also presented for a subset of the North American Regional Climate Change and Assessment Program model data.  Here fields on the order 10^4 observations are compared within the R data analysis environment.

***

4/13/15

Assistant Professor Qin Zhang, School of Informatics and Computing, Indiana University

Computational Models for Big Data

There has been a spectacular increase in the amount of data being collected and processed in many modern applications, and traditional algorithm theory developed in the classical random-access memory (RAM) model is no longer suitable for massive data analysis.  In this talk, I will try to address this issue by introducing several successful computational models for handling massive data, including the data stream model, the multiparty computation model and the distributed monitoring model.  I will highlight the primary features they capture and the central issues we need to explore.

Qin Zhang obtained his Ph.D. degree in Computer Science and Engineering from Hong Kong University of Science and Technology in 2010, and his B.Sc. degree in Computer Science from Fudan University in 2006.  He was a postdoc at MADALGO Aarhus University Denmark during 2010-2012, and at IBM Research Almaden during 2012-2013.  His current research interests include algorithms for massive data, data streams, algorithms on distributed data, data structures, and database algorithms.

***

3/30/15

Associate Professor Chunfeng Huang, Department of Statistics, Indiana University

Intrinsic Random Functions and Universal Kriging on the Circle

Intrinsic random functions (IRF) provide a versatile approach when the assumption of second-order stationarity is not met. Here, we develop the IRF theory on the circle with its universal kriging application.  Unlike IRF in Euclidean spaces, where differential operations are used to achieve stationarity, our result shows that low-frequency truncation of the Fourier series representation of the IRF is required for such processes on the circle. All of these features and developments are presented through the theory of reproducing kernel Hilbert space. In addition, the connection between kriging and splines is also established, demonstrating their equivalence on the circle.

***

3/24/15

Chester Ismay, Assistant Professor of Mathematics and Computer Science, Ripon College.

New Ideas in Teaching and Assessment in Introductory Statistics

The traditional introductory statistics curriculum is laden with formula after formula and often difficult leaps for students from descriptive statistics to probability and then to inference.  I will describe my attempts at avoiding this traditional style of the course in favor of a more student-centered course, largely based on computer simulation, to better understand inference.  My course has focused on using classroom voting technologies such as i>Clicker and Google Forms in addition to the brand new Google Classroom environment.  I use Google Forms for student homework to help get ideas for incorrect answers and then discuss these results in the following class period using clickers in the flipped classroom style.  In addition, I will describe my focus on cumulative and frequent quizzing following the cognitive science literature on improved techniques for successful learning and retention of ideas.

***

2/23/15

Professor David Donoho, Department of Statistics, Stanford University

Optimal Shrinkage of Singular Values and Eigenvalues under `Spiked' Big-Data Asymptotics

In the 1950's Charles Stein had the revolutionary insight that in estimating high-dimensional covariance matrices, the empirical eigenvalues ought to be dramatically outperformed by nonlinear shrinkage of the eigenvalues.  This insight inspired dozens of papers in mathematical statistics over the next 6 decades.

In the last decade, mathematical analysts working in Random Matrix Theory made a great deal of progress on the so-called ``spiked covariance model'' introduced by Johnstone (2001).

This talk will show how this recent progress makes it now possible to elegantly derive the unique asymptotically admissible shrinkage rules in many problems of matrix de-noising and covariance matrix estimation.

The new rules are very concrete and simple, and they dramatically outperform heuristics such as scree plot truncation and bulk edge truncation which have been around for decades and are used in thousands of papers across all of science.

Joint Work with Matan Gavish and Iain Johnstone.

***

2/12/15

Sanvesh Srivastava, SAMSI and Department of Statistical Science, Duke University

Expandable Factor Analysis

Modern data are characterized by their large sample-size and complex dependence structure. Bayesian nonparametric methods provide a general probabilistic approach for flexible modeling of such patterns, but they are computationally expensive. In particular, sampling based approaches used for posterior computation (e.g., MCMC methods) scale poorly in the sample-size and parameter dimension. This severely limits the applicability of Bayesian methods in massive data settings. Due to these limitations, Bayesian sparse factor models---a rich class of models that has received much attention recently---faces problems in estimation of high-dimensional loadings matrices and adaptive selection of factors. To address both these issues, we introduce the expandable factor analysis (xFA) framework. Using a novel multiscale generalized double Pareto prior, the xFA framework adaptively selects the required number of factors and enables efficient estimation of low-rank and sparse loadings matrices through weighted L1-regularized regression. Integrated nested Laplace approximations are used for model averaging to accommodate uncertainty in the number of factors and hyperparameters. Theoretical support for the computational algorithm and estimated parameters is discussed, and xFA's performance is demonstrated on both simulated and genomic data.

This talk is based on joint work with David B. Dunson (Duke University) and Barbara E. Engelhardt (Princeton University)

***

2/4/15

Yuekai Sun, Institute for Computational and Mathematical Engineering, Stanford University

A one-shot approach to distributed sparse regression

Modern massive datasets are usually not stored centrally, but distributed across machines connected by a network. The main computational challenge in a distributed setting is harnessing the computational capabilities of all the machines while keeping communication costs low. We focus on the high-dimensional regression problem and devise an approach that requires only a single round of communication among the machines. The main idea is to average ``debiased'' lasso estimates. We show the approach recovers the convergence rate of the lasso as long as each machine has access to an adequate number of samples.

***

1/29/15

Irina Gaynanova, Department of Statistics, Cornell University

Multi-group classification via sparse discriminant analysis

It has been observed that classical multivariate analysis tools perform poorly on modern datasets due to the presence of spurious correlations and over-selection of relevant features. In the literature these problems have been addressed separately, however their joint consideration can lead to significant improvements in terms of empirical performance and computational speed. In this talk I focus on multi-group discriminant analysis with motivating examples coming from genetics and metabolomics studies. The estimation problem is formulated using convex optimization framework, which allows the use of a computationally efficient block-coordinate descent algorithm. In addition to the computational aspects, I will discuss the theoretical guarantees on the variable selection and classification consistency. Finally, the proposed methodology is used to aid drug discovery in the study of tuberculosis

***

1/26/15

James Sharpnack, Mathematics Department, University of California, San Diego

Testing for Structured Normal Means

We will discuss the detection of pattern in images and graphs from a high-dimensional Gaussian measurement. This problem is relevant to many applications including detecting anomalies in sensor and computer networks, large-scale surveillance, co-expressions in gene networks, disease outbreaks, etc. Beyond its wide applicability, structured normal means detection serves as a case study in the difficulty of balancing computational complexity with statistical power.  We will begin by discussing the detection of active rectangles in images and sensor grids.  We will develop an adaptive scan test and determine it's asymptotic distribution.  We propose an approximate algorithm that runs in nearly linear time but achieves the same asymptotic distribution as the naive, quadratic run-time algorithm.

We will move on to the more general problem of detecting a well-connected active subgraph within a graph in the normal means context.  Because the generalized likelihood ratio test is computationally infeasible we propose approximate algorithms and study their statistical efficiency.  One such algorithm that we develop is the graph Fourier scan statistic, whose statistical performance is characterized by the spectrum of the graph Laplacian. Another relaxation that we have developed is the Lovasz extended scan statistic (LESS), which is based on submodular optimization and the performance is described using electrical network theory. We also introduce the spanning tree wavelet basis over graphs, a localized basis that reflects the topology of the graph.  For each of these tests we compare their statistical guarantees to an information theoretic lower bound.

***

1/13/15

Staci White, Department of Statistics, Ohio State University

A Monte Carlo Approach to Quantifying Model Error in Intractable Bayesian Hierarchical Models

In intractable Bayesian hierarchical models, the posterior distribution of interest is often replaced by a computationally efficient approximation that is instead used for inference. In this work, we propose a methodology that allows one to study the impact of such approximations by quantifying the model error, which we define to be the discrepancy between posterior distributions. This work provides a structure that can be used to analyze model approximations with regard to the reliability of inference and computational efficiency. We illustrate our approach through a spatial analysis of global sea surface temperature where covariance tapering is used to alleviate the computational demand associated with inverting a large, dense covariance matrix.

***

11/17/14

Assistant Professor David Crandall, School of Informatics and Computing, Indiana University

3d Reconstruction from Large Unstructured Photo Collections

Social photo-sharing websites like Flickr and Facebook now host hundreds of billions of images, and helping people navigate these vast social photo collections efficiently is a significant challenge. One approach is to use computer vision to estimate 3d scene models which can then be used both to navigate the photo collection and to visualize the scene itself. Recent work in reconstruction has successfully built 3d models from large unstructured collections of images from the web, but many techniques scale very poorly as the number of images grows. I'll present our approach that poses this problem as an inference task on a Markov Random Field (MRF) model. The formulation naturally incorporates various sources of (very noisy) evidence from photo matching and camera metadata, and yields better reconstruction results on many scenes while also scaling to larger photo collections. I'll show results on several large-scale datasets from the web and other sources, and discuss applications of the work in different domains.