Statistical Tools for Network Data: Prediction and
Resampling

Li, Tianxi

Statistical Tools for Network Data: Prediction and Resampling

Li, Tianxi

2018

View/Open

tianxili_1.pdf

(4MB

PDF)

Abstract

Advances in data collection and social media have led to more and more network data appearing in diverse areas, such as social sciences, internet, transportation and biology. This thesis develops new principled statistical tools for network analysis, with emphasis on both appealing statistical properties and computational efficiency. Our first project focuses on building prediction models for network-linked data. Prediction algorithms typically assume the training data are independent samples, but in many modern applications samples come from individuals connected by a network. For example, in adolescent health studies of risk-taking behaviors, information on the subjects' social network is often available and plays an important role through network cohesion, the empirically observed phenomenon of friends behaving similarly. Taking cohesion into account in prediction models should allow us to improve their performance. We propose a network-based penalty on individual node effects to encourage similarity between predictions for linked nodes, and show that incorporating it into prediction leads to improvement over traditional models both theoretically and empirically when network cohesion is present. The penalty can be used with many loss-based prediction methods, such as regression, generalized linear models, and Cox's proportional hazard model. Applications to predicting levels of recreational activity and marijuana usage among teenagers from the AddHealth study based on both demographic covariates and friendship networks are discussed in detail. Our approach to taking friendships into account can significantly improve predictions of behavior while providing interpretable estimates of covariate effects. Resampling, data splitting, and cross-validation are powerful general strategies in statistical inference, but resampling from a network remains a challenging problem. Many statistical models and methods for networks need model selection and tuning parameters, which could be done by cross-validation if we had a good method for splitting network data; however, splitting network nodes into groups requires deleting edges and destroys some of the structure. Here we propose a new network cross-validation strategy based on splitting edges rather than nodes, which avoids losing information and is applicable to a wide range of network models. We provide a theoretical justification for our method in a general setting and demonstrate how our method can be used in a number of specific model selection and parameter tuning tasks, with extensive numerical results on simulated networks. We also apply the method to analysis of a citation network of statisticians and obtain meaningful research communities. Finally, we consider the problem of community detection on partially observed networks. However, in practice, network data are often collected through sampling mechanisms, such as survey questionnaires, instead of direct observation. The noise and bias introduced by such sampling mechanisms can obscure the community structure and invalidate the assumptions of standard community detection methods. We propose a model to incorporate neighborhood sampling, through a model reflective of survey designs, into community detection for directed networks, since friendship networks obtained from surveys are naturally directed. We model the edge sampling probabilities as a function of both individual preferences and community parameters, and fit the model by a combination of spectral clustering and the method of moments. The algorithm is computationally efficient and comes with a theoretical guarantee of consistency. We evaluate the proposed model in extensive simulation studies and applied it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.

Subjects

Netowrk

Statistical analysis

Prediction

Random network modeling

Types

Thesis

Handle

https://hdl.handle.net/2027.42/145894

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.