Department of Biostatistics
Quantitative Issues in Cancer Research Working Seminar
2021 - 2022
ABSTRACT: Cross-study replicability is a powerful model evaluation criterion that emphasizes generalizability of predictions. Recent work in multi-study learning investigated two approaches for training replicable prediction models: (1) merging all the datasets and training a single model and (2) cross-study ensembling, which involves training a separate model on each data set and ensembling the resulting predictions. We study boosting in a multi-study setting and compare merging with cross-study ensembling in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We provide theoretical guidelines for determining whether it is more beneficial to merge or to ensemble when boosting with linear base-learners. We analytically characterize and confirm via simulations a transition point beyond which ensembling outperforms merging.
ABSTRACT: The semi-competing risks framework is characterized by a non-terminal event subject to a terminal event within a single subject, where the terminal event acts as a competing risk. While semi-competing risks have been examined for scenarios such as nested case-control studies and full cohort studies, there is a lack of methodology in published literature for case-cohort studies. The case-cohort design was introduced to reduce costs and increase efficiency for analysis, making it an attractive sampling strategy for studies. We will consider the unique aspects of case-cohort studies and discuss methods for parameter and variance estimation for an illness-death model for case-cohort study design.
ABSTRACT: As a consequence of exposure to various mutagenesis, cells accumulate somatic alterations in their lifetime. These alterations become more distinct for tumor cells when they start growing at a much faster pace. Here we have evaluated whole-genome sequencing data from patients diagnosed with precursor conditions to symptomatic newly diagnosed Myeloma, newly diagnosed Myeloma, and their relapsed pairs after the first line of treatment. We have shown how somatic mutational patterns in tumors at a given time would predict risk for disease progression and get affected by treatment and how clonal and subclonal cells show different patterns.
ABSTRACT: In this talk, I will first describe our characterization of nullomers, short (11-18 nts) DNA sequences that are absent from a genome. We identify all possible nullomers and nullpeptides in the genomes and proteomes of thirty eukaryotes and demonstrate that a significant proportion of these sequences are under negative selection. We next characterize all possible single base pair mutations that can lead to the appearance of a nullomer in the human genome, observing a significantly higher number of mutations than expected by chance for specific nullomer sequences in transposable elements, likely due to their suppression. We also annotate nullomers that appear due to naturally occurring variants and show that a subset of them can be used to distinguish between different human populations. Moreover, we demonstrate that nullomers can also be created due to somatic mutations in cancer. We refer to the subset of nullomers that are found recurrently in one cancer type as neomers. We show that we can distinguish twenty-one different tumor-types with higher accuracy than state-of-the-art methods using a neomer-based classifier. Refinement of this classifier via supervised learning identified additional cancer features with even greater precision. We also demonstrate that neomers can precisely diagnose cancer from cfDNA in liquid biopsy samples. Finally, we show that neomers can be used to detect cancer-associated non-coding mutations affecting gene regulatory activity.
ABSTRACT: The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research has become a barrier to translating precision medicine research into practice. Due to heterogeneity across populations, risk prediction models are often found to be underperformed in these underrepresented populations, and therefore may further exacerbate known health disparities.
In this paper, we propose a two-way data integration strategy that integrates heterogeneous data from diverse populations and from multiple healthcare institutions via a federated transfer learning approach. The proposed method can handle the challenging setting where sample sizes from different populations are highly unbalanced. With only a small number of communications across participating sites, the proposed method can achieve performance comparable to the pooled analysis where individual-level data are directly pooled together. We show that the proposed method improves the estimation and prediction accuracy in underrepresented populations, and reduces the gap of model performance across populations. Our theoretical analysis reveals how estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We demonstrate the feasibility and validity of our methods through numerical experiments and a real application to a multi-center study, in which we construct genetic risk prediction models for Type II diabetes in an African-ancestry population.
ABSTRACT: In the nursing home setting, costs and healthcare utilization are two common outcomes of interest. However, cost data typically follows a semi-continuous distribution, with a large concentration of zero values and a right skewed distribution of positive values. First, I will discuss the logistic-lognormal two-part model commonly used to analyze this data. Furthermore, I will talk about the Bayesian semiparametric framework for the random effects we are proposing to extend this model’s flexibility. Second, metrics often used to compare semi-continuous data do not consider the that the data arise from two distinct stochastic processes: one that governs the occurrence of zeros and the other determining the observed value conditional on it being a non-zero response. I will discuss two-dimensional metrics we are developing that jointly look at performance in terms of whether more than expected people are accruing non-zero costs and whether those folks who do accrue non-zero costs are accruing more than expected such costs.
ABSTRACT: Recent developments in collecting individual-level phenotypic data in free-living settings through wearable devices and smartphones have afforded researchers the ability to sample high fidelity data that concern human behavior and health. Such advancements are paving the way for researchers interested in studying social, behavioral, and cognitive phenotypes that have proven to have a temporal and contextual dependence. While digital phenotyping provides a robust and efficient mechanism for collecting temporally dense data on populations of interest, challenges remain in developing tools that utilize this data for identifying dynamic behavioral trends among heterogeneous subjects. In this talk, I will explore approaches towards advancing classical hidden Markov models to address these challenges (i.e. high-dimensionality, state representation, heterogeneity, temporal resolution) in effort to reveal clinically meaningful modes of differentiation in subject behavior.
ABSTRACT: We outline clinical trial landscape of Glioblastoma Multiforme, with suggestions on how to best use the clinical information provided by this population. We review the advantages, disadvantages and barriers of various clinical trial approaches in the context of this disease. We further propose shared-control arm approaches to experimentation, and provide initial simulation data to demonstrate its benefits and trade-offs.
ABSTRACT: Panel germline testing allows for the efficient detection of multiple pathogenic variants in an individual. However, because the associations and clinical guidelines for harmful mutations and heritable diseases are not always well-established, it may not be beneficial to make panels arbitrarily large. We propose a multi-gene, multi-disease aggregate utility formula that allows the user to consider the addition or removal of each gene based on its own merits, using both quantitative measures and individualized utility costs. Our approach includes credible intervals to reflect the quality of the parameter estimates used to calculate the utility. We calculate the utilities to evaluate ATM, BRCA1, BRCA2, CHEK2, and PALB2 for possible inclusion in an opportunistic breast cancer panel. We further explore the behavior of our approach under different scenarios for a range of parameter values. Our findings suggest that rare, highly penetrant pathogenic variants tend to contribute positive net utilities, for a wide variety of userspecified utility costs and even when accounting for uncertainty in parameter estimation.
ABSTRACT: Cytometry by time of flight, or CyTOF, is a powerful alternative to flow cytometry for quantifying targets on the surface and interior of cells. CyTOF data requires considerable cleaning because many observations are debris, doublets, or calibration beads. As with any technology, the data analysis is only as good as the data itself so careful data cleaning is essential. One of the biggest data cleaning challenges is dealing with doublets because it is difficult to distinguish between large cells and doublets. I will introduce an R package, cleanCytof, that uses a modeling and labeling approach to data cleaning that allows for more careful and customized cleaning of CyTOF data.
ABSTRACT: A challenge of public health surveillance is tracking indicators in real-time when there are reporting delays. State-level mortality data in the United States is subject to reporting delays of up to 18 weeks, causing gaps between reported and true mortality in the short-term. Existing methods for correcting gaps from reporting delays do not appropriately account for seasonality or time trends in prior lags. I am using state-level CDC and DPH data from January 2017-December 2021 to develop a model that accurately predicts the true death count on a weekly basis, thereby reconciling the gap between reported and true deaths. Specifically, I will introduce non-parametric models that flexibly account for seasonality and trends to obtain unbiased estimates of gaps with appropriate measures of uncertainty.
ABSTRACT: Methods that leverage cross-phenotype associations or pleiotropy in risk prediction have been shown to achieve improved performance compared to single-phenotype analyses. Before jointly analyzing multiple phenotypes in risk prediction models, we hope to obtain a list of candidate variants that are at least associated with one of the phenotypes of interest. With data from multiple biobanks, we propose a novel federated algorithm for testing the SNP-level association across multiple mixed type phenotypes, termed as mixWAS.
ABSTRACT: Many single-cell RNA-seq experiments aim to identify cell types that are transcriptionally different between two or more biological conditions. Existing computational approaches to this problem are sensitive to bias induced by pseudoreplication, non-independence of cells belonging to the same sample or patient. We introduce pcDiffPop, a statistical method that uses linear mixed-effects models to find significantly perturbed cell types while controlling for the sample-level variability present in single-cell RNA-seq data. pcDiffPop operates by estimating the distance between the group means in a low-dimensional embedding space. Using both real and simulated single-cell datasets, we show that pcDiffPop is accurate and, unlike competing methods, robust in the presence of pseudoreplication bias. pcDiffPop is also computationally efficient (scalable to datasets with millions of cells) and capable of controlling for other possible confounders such as age or batch. We demonstrate pcDiffPop by using it to compare cell types between responders and non-responders to immunotherapy. On melanoma samples, we identify a macrophage signature associated with poor response to checkpoint inhibitors. We also demonstrate how pcDiffPop can be used to formally test whether two cell clusters are distinct. Our examples highlight the utility of pcDiffPop as a tool for the exploratory analysis of single-cell RNA-seq data.
ABSTRACT: Using principles of Mendelian genetics, probability theory, and mutation-specific knowledge, Mendelian models identify those at high risk for carrying a heritable cancer-susceptibility mutation and assess future risk of cancer, based on family history. These quantitative risk measures can be used for research and to tailor personalized prevention programs. Our previously proposed PanelPRO model is a generalizable, computationally efficient Mendelian risk prediction framework that incorporates an arbitrary number of gene-cancer associations. However, there are pragmatic challenges in the implementation of such a comprehensive model. There may be uncertainty in estimating the necessary population-level model parameters among rare genes and cancers. Obtaining detailed patient family history information for a large number of cancers may also be impractical. Moreover, family history information is often incompletely and poorly gathered, or gathered from a minority of patients. Motivated by the clinical context of pre-screening for a test of any cancer, we investigate a Mendelian model that aggregates information across genes and cancers, reducing the amount of patient information that needs to be collected and avoiding the need for more robust parameter estimation for rare genes and syndromes. This aggregate approach is evaluated through simulations and applied to two clinical cohorts.
ABSTRACT: One of the central tenets of biology is that our genetics—our genotype—influences the physical characteristics we manifest—our phenotype. But with more than 25,000 human genes and more than 6,000,000 common genetic variants mapped in our genome, finding associations between our genotype and phenotype is an ongoing challenge. Indeed, genome-wide association studies have found thousands of small effect size genetic variants that are associated with phenotypic traits and disease. The simplest explanation is that genes and genetic variants work together in complex regulatory networks that help define phenotypes and mediate phenotypic transitions. We have found that the networks, and their structure, provide unique insight into how genetic elements interact with each other and the structure of the network has predictive power for identifying critical processes in health and disease and for identifying potential therapeutic targets. I will touch on multiple examples illustrating the importance of network models, drawing on my work in cancer, in chronic obstructive pulmonary disease, and in the analysis of data from thirty-eight tissues provided by the Genotype-Tissue Expression (GTEx) project. We will use these to explore the development and progression of disease and new ways to identify therapeutics.
ABSTRACT: Endometrial cancer (EC) is the most commonly diagnosed gynecologic cancer affecting 1 in 37 women each year. Over the past few decades, the incidence and mortality of EC has been increasing for all racial-ethnic groups, however there are apparent disparities by race-ethnic group and socioeconomic status (SES). African American women have on average 55% higher 5-year mortality risk than white women, and like other minority groups are vulnerable to receiving suboptimal care due to differences in the cultural and socioeconomic environments in which they reside. Previous research has used factors, such as educational attainment, household income or occupation as proxies for SES, however, SES as a social determinant of health embodies multiple factors that in combination better explain inequities in health. We aim to take a multifactorial approach in how we examine racial/ethnic and socioeconomic factors leading to bias and disparities in the receipt of optimal care for EC patients. Using census tract aggregate level data from the Massachusetts Cancer Registry, we will apply a Multivariate Beta Mixture Model to cluster several social determinants of health to better understand the social dimension of EC care and treatment in Massachusetts.
ABSTRACT: Flexible estimation of heterogeneous treatment effects is central to precision medicine. While efforts in systematic data sharing and data curation initiatives increased access to multiple datasets, existing methods for estimating heterogeneous treatment effects are largely rooted in theory based on a single study. We propose a general class of two-step algorithms for treatment effect estimation in multiple studies. The approach is easy to use and allows for flexible modeling with machine learning techniques in both steps. It is an extension of the R-learner and provides an unifying framework for multi-study heterogeneous treatment effect estimation.
ABSTRACT: The case-cohort study design is well-known as a cost-effective outcome-dependent sampling scheme for large observational studies. However, when interest lies in semi-competing risks, a setting where a non-terminal event and a terminal event (usually death) are investigated simultaneously, there are currently no statistical methods for the analysis of data arising from a case-cohort design. We propose a novel statistical method for analyzing such data and an innovative simulation-based framework for designing such studies in resource-limited settings.
ABSTRACT: Prostate cancer has one of the highest estimates of heritability of any malignancy, and genome wide association studies have identified >260 single nucleotide polymorphisms associated with prostate cancer risk that validate in multiethnic populations. This talk will provide a background on the evidence on family history, twin studies, and genetic epidemiology studies and prostate cancer to date. It will discuss the translation of this multiethnic polygenic risk score for both prevention and early detection, and discuss future directions including the integration of germline variation in DNA repair pathways.
ABSTRACT: Racial inequities in clinical performance diminish overall system performance; however, quality assessments have rarely incorporated reliable measures of racial inequities. We studied care for over 1 million Medicare fee-for-service beneficiaries with cancer to assess the feasibility of calculating reliable practice-level measures of racial inequities in chemotherapy-associated emergency department (ED) visits and hospitalizations. Specifically, we used hierarchical models to estimate adjusted practice-level Black-White differences in these events and described differences across practices. As a second goal, we assessed how practice-level measures of these Black-White differences changed after adjustment for socioeconomic variables of patients treated in the practices.
ABSTRACT: In the nursing home setting, costs and healthcare utilization are two common outcomes of interest. However, cost data typically follows a semi-continuous distribution, with a large concentration of zero values and a right skewed distribution of positive values. I will talk about existing methods that are used to model semi-continuous data before exploring ideas to expand to the competing risks setting. I will focus on a recent paper by Nevo et. Al (2020) that models semi-competing risks data as a longitudinal bivariate process as a starting point. Additionally, I will talk about the Bayesian semiparametric framework for the random effects we are proposing to extend this model’s flexibility.
ABSTRACT: Many fields of research, in particular psychology and psychiatry, study latent-state constructs that form the internal models governing human behavior over time. For most phenomena of interest, there is a natural time scale whereby a person transitions from one internal state to another. Generally, not only are these states unobserved, but the exact time at which between-state transitions occur are unknown. By using a combination of surveys (active data) and sensors (passive data), we can uncover the latent-state structure and estimate transition times of the latent state process. However, both active and passive data consume resources: frequent surveys can burden participants and excessive data collection on their personal digital devices may interfere with their use of those devices in their daily lives. An important part of study design is therefore to determine how much and how frequently to collect each type of data and how to best summarize such data. In this work, we capture these latent state processes using state space models and study the optimal allocation of data collection resources (i.e., the sampling rate for both active and passive data) necessary to accurately uncover the underlying latent state process given the natural time scale of the behavioral phenomenon.
ABSTRACT: Modern clinical medicine has many scenarios where there are multiple treatments for the same indication. When these treatments can be delivered together, the large number of potential treatment combinations means that it can be difficult to learn about them all. Often the primary quantity of interest is the best treatment combination for a given patient population. To address this evaluation gap, we introduce a Bayes-adaptive trial design for the factorial setting. While traditional factorial designs balance the assignment probabilities to each arm, our design uses a decision theoretic framework to adjust the probabilities with which patients are randomized to treatment combination arms. Treatment assignment is carried out with the aim of maximizing expected utility at the end of the trial, according to some pre-defined utility function. We model the data with a Bayesian model and define a map from the model to the randomization probability. We discuss potential choices for utility functions and their resulting trials, as well as the computational approximations required to carry out the trial. We further discuss the asymptotic allocation of such a design, and apply our design to both artificial scenarios and data taken from real-world trials.
|Back to SPH Biostatistics||
Maintained by the
Last Update: April 27, 2022