Department of Biostatistics Quantitative Issues in Cancer Research Working Seminar 2022 - 2023 |
ABSTRACT: We consider the problem of borrowing information from historical controls to reduce the control group size and improve treatment effect estimation in subsequent randomized clinical trials. The key statistical challenge is to appropriately control the degree of information borrowing so the historical data are relied upon when relevant but discounted when irrelevant. Popular methods attempt to compromise between these goals by using priors that allow the amount of historical borrowing to adapt to how similar the historical and new trial data appear. We propose the SPx method, standing for "synthetic prior with covariates", which extends existing approaches by accounting for different sources of heterogeneity between historical data and current trial data. The key statistical tool in SPx is model averaging that allows diverse and dynamic borrowing. It is formulated to borrow trial-level summary statistics that are easily found in the literature. This may be useful for practical situations when patient-level data are not available. We show that when combined with a simple two-stage adaptive design, historical borrowing via SPx can substantially reduce the needed control group size compared to alternative methods while maintaining or improving the Frequentist power and Type I error rate.
ABSTRACT: Mendelian randomization (MR) harnesses genetic variants as instrumental variables (IVs) to study the causal effect of exposure on outcome. Two-sample MR recapitalizes on summary statistics from genome-wide association studies, and it has achieved many successes in identifying genetically regulated risk exposures. In this talk, I will present our recent works in studying two types of exposure traits, molecular traits and complex traits. When considering gene expression as exposure in transcriptome-wide MR (TWMR) analyses, the eQTLs (expression-quantitative-trait-loci) may have pleiotropic effects or be correlated with variants that have effects on disease not via expression. The presence of those invalid IVs would lead to biased inference. Moreover, the number of eQTLs as IVs for a gene is generally limited, making the detection of invalid IVs challenging. We propose methods for accurate TWMR inference in the presence of invalid IVs, by leveraging multi-tissue and/or multi-omics data and making identifiable the IV-specific pleiotropic effects. In studying complex trait as exposure, a challenge is when IVs are associated with unmeasured confounders, i.e., when correlated horizontal pleiotropy (CHP) arises. Such confounders could be a shared gene or inter-connected pathways underlying exposure and outcome. We propose a method for estimating causal effect while identifying IVs with CHP and accounting for estimation uncertainty. For those IVs, we map their cis-associated genes and enriched pathways to inform shared genetic etiology underlying exposure and outcome.
ABSTRACT: In the nursing home setting, costs and healthcare utilization are two common outcomes of interest. However, cost data typically follows a semi-continuous distribution, with a large concentration of zero values and a right skewed distribution of positive values. First, I will discuss the logistic-lognormal two-part model commonly used to analyze this data. Furthermore, I will talk about the Bayesian semiparametric framework for the random effects we are proposing to extend this model’s flexibility. Second, metrics often used to compare semi-continuous data do not consider that the data arise from two distinct stochastic processes: one that governs the occurrence of zeros and the other determining the observed value conditional on it being a non-zero response. I will discuss two-dimensional metrics we are developing that jointly look at performance in terms of whether more than expected people are accruing non-zero costs and whether those folks who do accrue non-zero costs are accruing more than expected such costs.
ABSTRACT: Analyzing multiple studies allows leveraging data from a range of sources and populations, but until recently, there have been limited methodologies to approach the joint unsupervised analysis of multiple high-dimensional studies. Recent methods can identify shared signals across datasets, as well as signals specific to particular groups. However, especially as the number of datasets grows, we expect the presence of signals with more complex sharing patterns. We propose two flexible Bayesian multi-study latent feature models to address this problem. The first is a combinatorial multi-study factor analysis method, which identifies latent factors that can be shared by any combination of studies. We model the subsets of studies that share latent factors with an Indian Buffet Process, and demonstrate our method's utility not only in dimension reduction but also in covariance estimation. The second is an extension of this approach to multi-study non-negative matrix factorization, specialized to application in the characterization of mutational signatures from tumor genomes. We develop both fully unsupervised and semi-supervised approaches, which allows novel signatures to be discovered and known signatures to be recovered. Finally, we incorporate tumor-level covariates into the model to estimate associations with signatures, using a non-local spike-and-slab prior to enforce biologically plausible sparsity. We demonstrate both approaches in integrating multiple datasets from breast and colorectal cancer respectively.
ABSTRACT: Methods that leverage cross-phenotype associations or pleiotropy in risk prediction have been shown to achieve improved performance compared to single-phenotype analyses. We introduce mixWAS, a new cross-phenotype association test for mixed data type phenotypes tailored to work with data in a federated setting, when multiple sites can not share individual level data due to privacy restrictions. Given the wide range of possible forms pleiotropy may take, mixWAS is designed to be powerful against both dense alternatives, where many phenotypes are associated with the SNP in question, as well as sparse alternatives, where the majority of phenotypes are not associated with the SNP. In this talk, we present background and motivation for the method, power simulations comparing mixWAS to existing PheWAS methods, and discuss preliminary results of applying mixWAS to real EHR data from eMERGE to identify pleiotropic SNPs.
ABSTRACT: Endometrial cancer (EC) is the most common gynecologic cancer in the United States affecting 1 in 37 women each year. Over the past few decades, the incidence and mortality of EC has been increasing for all racial-ethnic groups, with the highest rate of increase observed among racial-ethnic minority groups. African American women have on average 55% higher 5-year mortality risk than white women, and like other minority groups are vulnerable to receiving suboptimal care due to differences in the cultural and socioeconomic environments in which they reside. Previous research has used factors, such as educational attainment, household income or occupation as proxies for SES, however, SES as a social determinant of health embodies multiple factors that in combination better explain inequities in health. We aim to take a multifactorial approach in how we examine racial/ethnic and socioeconomic factors leading to bias and disparities in the receipt of optimal care for EC patients. Using census tract aggregate level data and patient-level information from the Massachusetts Cancer Registry, we will apply a Multivariate Beta Mixture Model to cluster several social determinants of health to better understand the social dimension of EC care and treatment in Massachusetts.
ABSTRACT: Dimensionality reduction is a critical step in the analysis of single cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal component analysis (PCA). However, this approach can spuriously indicate heterogeneity where it does not exist and mask true heterogeneity where it does exist. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large data sets and do not quantify uncertainty in the low dimensional representation. To address these problems, we develop scGBM , a novel method for model based dimensionality reduction of scRNA-seq data. scGBM employs a scalable algorithm to fit a Poisson bilinear model to datasets with millions of cells and quantifies the uncertainty in each cell's latent position. Furthermore, scGBM leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single cell data, we find that scGBM produces low dimensional embeddings that better capture relevant biological information while removing unwanted variation. scGBM is publicly available as an R package.
ABSTRACT: On July 15, 2021, with 58% of the population fully vaccinated, the start of a COVID-19 surge was observed in Puerto Rico. On July 22, 2021, the government of Puerto Rico started imposing a series of strict vaccine mandates. Two months later, over 70% of the population was vaccinated, more than in any US state, and laboratory-confirmed SARS-CoV-2 had dropped substantially. The decision to impose mandates, as well as current Department of Health recommendations related to boosters, were guided by the data and the effectiveness estimates presented here. Between December 15, 2020, when the vaccination process began in Puerto Rico, and October 15, 2021, 2,276,966 individuals were fully vaccinated against COVID-19. During this period 112,726 laboratory-confirmed SARS-CoV-2 infections were reported. These data permitted us to quantify the outcomes of the immunization campaign and to compare effectiveness of the mRNA-1273 (Moderna), BNT162b2 (Pfizer), and Ad26.COV2.S (J&J) vaccines. We obtained vaccination status, SARS-CoV-2 test results, and COVID-19 hospitalizations and deaths, from the Department of Health. We fit statistical models that adjusted for time-varying incidence rates and age group to estimate vaccine effectiveness, since the time of vaccination, against lab-confirmed SARS-CoV-2 infection, and COVID-19 hospitalization and death. Two weeks after final dose, the mRNA-1273, BNT162b2, and Ad26.COV2.S vaccines had an effectiveness of 90% (95% CI: 88–91), 87% (85–88), and, 64% (58–69), respectively. After five months, effectiveness waned to about 70%, 50%, and 40%, respectively. We found no evidence that effectiveness was different after the Delta variant became dominant. For those infected, the vaccines provided further protection against COVID-19 hospitalization and deaths across all age groups, and this conditional effect did not wane in time. The mRNA-1273 and BNT162b2 vaccines were highly effective across all age groups. They were still effective after five months although the protection against SARS-CoV-2 infection waned. The Ad26.COV2.S vaccine was effective but to a lesser degree compared to the mRNA vaccines. Although, conditional on infection, protection against adverse outcomes did not wane, the waning in effectiveness resulted in a decreased protection against serious COVID-19 outcomes across time.
ABSTRACT: The case-cohort study design is well-known as a cost-effective outcome-dependent sampling scheme for large observational studies. However, when interest lies in semi-competing risks, a setting where a non-terminal event and a terminal event (usually death) are investigated simultaneously, there are currently no statistical methods for the analysis of data arising from a case-cohort design. I will discuss the method we are developing for analyzing such data and the framework for designing such studies in resource-limited settings.
ABSTRACT: We outline the clinical trial landscape of Glioblastoma Multiforme, with suggestions on how to best use the clinical information provided by this population. We review the advantages, disadvantages and barriers of various clinical trial approaches in the context of this disease. We further propose shared-control arm approaches to experimentation, and provide initial simulation data to demonstrate its benefits and trade-offs.
ABSTRACT: Extracorporeal membrane oxygenation (ECMO) is a life support technology used for the management of cardiopulmonary failure. ECMO circuits incorporate either roller or centrifugal blood pumps to achieve circuit flow and support cardiac output. Since 2010, the use of centrifugal pumps in pediatric medicine has increased with technological advances and ease of use; however, recent clinical and registry-based studies have found higher rates of complications in small children supported with centrifugal pumps relative to roller pumps. Using the Extracorporeal Life Support Organization (ELSO) registry, we evaluated the association of blood pump with in-hospital mortality among smaller (10kg) and larger (10kg) children. We implemented a combined imputation, inverse propensity weight, and bootstrap approach to account for institutional variability in treatment patterns and obtain valid estimates of treatment effect.
ABSTRACT: This talk will discuss the following paper: Emmons, K. M. and Colditz, G. A. (2017) Realizing the Potential of Cancer Prevention — The Role of Implementation Science. N Engl J Med. Massachusetts Medical Society. DOI: 10.1056/nejmsb1609101.
ABSTRACT: None Given.
ABSTRACT: State-level mortality data in the United States is subject to reporting delays of up to 18 weeks, causing gaps between reported and true mortality in the short-term. Existing methods for correcting gaps from reporting delays do not appropriately account for seasonality or time trends in prior lags. We use state-level CDC and DPH data from January 2015-December 2021 to develop a model that accurately predicts the true death count on a weekly basis, thereby reconciling the gap between reported and true deaths. Specifically, we built both a non-parametric model and an estimator based on empirical lag patterns that flexibly account for seasonality and trends to obtain unbiased estimates of gaps with appropriate measures of uncertainty.
ABSTRACT: The advent of large observational databases and cohort studies introduces a rich source of data. Often, specific risk factors of interest to researchers may either not have been collected in resource-limited settings or are difficult to ascertain due to cost constraints. The case-cohort study design is well-known as a cost-effective outcome-dependent sampling scheme for studies embedded within large cohort studies. However, when interest lies in semi-competing risks, a setting where a non-terminal event and a terminal event (usually death) are investigated simultaneously, there are currently no statistical methods for the analysis of data arising from a case-cohort design. We propose a model for estimation and inference for this study design and provide initial simulation data for this framework.
ABSTRACTS: Project 1: Isolation guidelines for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are largely derived from data collected prior to the emergence of the delta variant. We followed a cohort of ambulatory patients with postvaccination breakthrough SARS-CoV-2 infections with longitudinal collection of nasal swabs for SARS-CoV-2 viral load quantification, whole-genome sequencing, and viral culture. All delta variant infections in our cohort were symptomatic, compared with 64% of non-delta variant infections. Symptomatic delta variant breakthrough infections were characterized by higher initial viral load, longer duration of virologic shedding by PCR, greater likelihood of replication-competent virus at early stages of infection, and longer duration of culturable virus compared with non-delta variants. The duration of time since vaccination was also correlated with both duration of PCR positivity and duration of detection of replication-competent virus. Nonetheless, no individuals with symptomatic delta variant infections had replication-competent virus by day 10 after symptom onset or 24 hours after resolution of symptoms. These data support US CDC isolation guidelines as of November 2021, which recommend isolation for 10 days or until symptom resolution and reinforce the importance of prompt testing and isolation among symptomatic individuals with delta breakthrough infections.
Project 2: The Sustainable Development Goals set out an ambitious goal to end all forms of malnutrition by 2030. Although there has been a reduction in stunting (low height for age) and wasting (low height for weight), the prevalence of malnutrition in Ethiopia is still high. To improve nutritional outcomes, granular data are needed to determine key time points for growth and weight faltering. The Birhan maternal and child health study in North Shewa Zone in Amhara, Ethiopia, collected longitudinal data used in this study to determine key time points for growth and weight faltering. We investigated growth and weight faltering at birth, four weeks, six, 12 and 24 months. Our findings indicate that median population-level length and weight among children in this population are consistently below global standards from birth to age two. Growth velocity and weight gain was slowest compared to global standards during the neonatal period and after children reached six months of age. The prevalence of stunting was highest at age two (56.7%), whereas the prevalence of wasting was lower and peaked at birth (18.4%). Incidence of stunting increased over time whereas it decreased for wasting. We also found substantial within-individual heterogeneity in anthropometric measurements. Overall, the evidence from this study highlights a chronically malnourished population compared to global standards, with much of the burden driven by growth and weight faltering during the pre- and neonatal periods as well as after 6 months of age. To end all forms of malnutrition, growth and weight faltering in populations such as that in young children in Amhara, Ethiopia needs to be addressed.
ABSTRACT: This paper is a culmination of my work at Columbia SPH, Department of Epidemiology. Briefly, the existence of socioeconomic, racial and ethnic gaps in the understanding of the dense breast notification (DBN) legislation (passed in 2019) have been documented. However, its effect on women’s cognitive and emotional appraisal of the notification information which affect screening behavior remained unknown. In this project, we examined short- and long-term psychological responses to DBN and awareness of breast density by education, health literacy, nativity and dominant language. We used data from a predominantly Latina and foreign-born New York City screening cohort (63% Spanish-speaking) and ages 40-60. We found that associations of breast density awareness and breast cancer related psychological outcomes differed by education and language. Women with lower educational attainment or language barriers could specifically benefit from outreach to clarify the implications of breast density and reduce uncertainty around risk and screening choices.
ABSTRACT: Causal inference methods based on electronic health record (EHR) databases must simultaneously handle confounding and missing data. Vast work exists to address these two separately, but surprisingly few papers attempt to address them simultaneously. In practice, when faced with simultaneous missing data and confounding, analysts may proceed by first imputing missing data and subsequently use outcome regression or inverse-probability weighting (IPW) to address confounding. However, little is known about the performance of such ad-hoc methods. In a recent paper Levis et al. (2022) outline a robust framework for tackling these problems together and introduce a pair of semi-parametric efficient estimators for the average treatment effect (ATE) which differ in the conditions regarding missing data they assume. In this work we present a series of simulations, motivated by a published EHR based study of the long-term effects of bariatric surgery on weight outcomes, to investigate the new estimators of Levis et al., and compare them to existing ad-hoc methods. While the latter perform well in certain scenarios, no single estimator is uniformly best. As such, the work of Levis et al. may serve as a reasonable default for causal inference when handling confounding and missing data together.
ABSTRACT: I will talk about some efforts in my lab to understand how the differentiation and proliferation dynamics of mutated blood stem cells deviate from that of healthy stem cells in certain types of blood cancers. We have been able to infer the history of expansion of cancer in individual patients by reconstructing the lineage tree of the cancer cells from the pattern of naturally occurring somatic mutations in each cell’s genome. I will also talk about how we can use synthetic biology to record each cell’s lineage history in its own DNA, circumventing the need for naturally occurring somatic mutations.
ABSTRACT: The early detection of hepatocellular carcinoma (HCC) is critical to improving outcomes since advanced HCC has limited treatment options. Blood-based biomarkers are a promising direction since they are more easily standardized and less resource intensive than standard of care imaging. Combining multiple biomarkers is more likely to achieve the sensitivity required for a clinically useful screening algorithm and the longitudinal trajectory of biomarkers contains valuable information that should be utilized. We have proposed two longitudinal biomarker algorithms. The first is a multivariate fully Bayesian algorithm (mFB) that models the joint biomarker trajectory and uses the posterior risk of HCC estimate to make screening decisions. The second is a multivariate parametric empirical Bayes (mPEB) screening approach that defines personalized thresholds for each patient at each screening visit to identify significant deviations that trigger additional testing with more sensitive imaging. The Hepatitis C Antiviral Long-term Treatment against Cirrhosis (HALT-C) trial provides a valuable source of data to study HCC screening algorithms. We study the performance of the mFB and mPEB algorithm applied to serum alpha-fetoprotein, a widely used HCC surveillance biomarker, and des-gamma carboxy prothrombin, an HCC risk biomarker that is FDA approved but not used in practice in the United States.
ABSTRACT: The use of patient level information from previous studies, registries or other external datasets can support the analysis of single arm and randomized clinical trials to evaluate and test experimental treatments. However, the use of external datasets to analyze clinical trials can also compromise the scientific validity of the results due to selection bias, study to study differences, unmeasured confounding and other distortion mechanisms. Therefore, the integration of external data in the analysis of a clinical trial requires the use of appropriate methods that can detect or mitigate the risks of bias and potential distortion mechanisms. Several methods to leverage external datasets have been proposed, such as matching procedures or random effect modelling. Different methods present distinct trade offs between risks and efficiency. We conduct a comparative analysis of statistical methods to leverage external data for the analysis of randomized clinical trials. Multiple operating characteristics are evaluated, such as power, control of false positive results and bias of the treatment effects' estimates, across candidate statistical methods. We compare the statistical methods through a comprehensive set of simulation scenarios. We also compare the methods using a collection of datasets with patient level information from several glioblastoma studies, that includes 1388 patients, in order to provide specific recommendations for future glioblastoma trials.
ABSTRACT: Dimensionality reduction is a critical step in the analysis of single-cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal component analysis (PCA). However, this approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large datasets and do not quantify uncertainty in the low-dimensional representation. To address these shortcomings, we develop scGBM, a novel method for model-based dimensionality reduction of scRNA-seq data. scGBM employs a scalable algorithm based on weighted low rank approximations to fit a Poisson bilinear model to datasets with millions of cells. Furthermore, scGBM quantifies the uncertainty in each cell’s latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single-cell data, we find that scGBM produces low-dimensional embeddings that better capture relevant biological information while removing unwanted variation.
ABSTRACT: Cancer is a complex disease that requires prompt and accurate diagnosis to increase survival. Traditional diagnosis methods (TDM) consist mostly of blood and imaging tests requiring human interpretation for proper cancer staging. In this sense, TDMs suffer from limited accuracy and subjectivity, and can be time-consuming. AI methods are being introduced as a tool to improve the accuracy and efficiency of the cancer diagnosis process, which, in turn, can improve treatment plans and survival rates. AI methods currently being used stem from traditional machine learning algorithms, turning increasingly complex to further improve accuracy. Currently, AI cancer methods rely on deep-learning algorithms like convoluted neural networks and natural language processing. In particular, a plethora of these algorithms has been proven accurate in imaging analysis and in histopathology of the disease, key aspects of an accurate cancer diagnosis. These AI techniques can analyze increasingly large volumes of medical data and detect patterns and characteristics of tumors, aiding personalized diagnosis and treatment plans. AI methods are popular in cancer research, and a few of them have made it to the clinical realm. However, there are still limitations and challenges associated with the use of AI in clinical settings. Some of these challenges include ensuring that the algorithm performs accurately and consistently in different environments, standardizing AI algorithms and processes, and addressing ethical and privacy concerns such as bias in AI. Despite these challenges, AI has the potential to revolutionize cancer management, and the field is rapidly advancing with new innovations and discoveries.
ABSTRACT: Precision medicine and rare disease research require the integration of complex and diverse data from various sources to enable better diagnosis and treatment. Knowledge graphs are a powerful tool for integrating and analyzing such data by representing information on various biological scales in a structured way. In this presentation, we introduce the concept of knowledge graphs and discuss their potential applications to precision medicine and rare disease research. We present a recent knowledge graph implementation PrimeKG with the purpose of identifying potential drug targets, biomarkers, and patient subgroups. More generally, we discuss how knowledge graphs can enable data driven hypothesis generation, enhance our understanding of complex biological systems and diseases, and accelerate the development of personalized medicine and novel therapies for rare diseases.
ABSTRACT: Subgroup analyses of randomized controlled trials (RCTs) constitute an important component of the drug development process in precision medicine. However, these subgroup analyses are typically complicated by small sample sizes for the subgroups of interest. This can lead to substantial uncertainty on the subgroup-specific treatment effects. In this work we explore the use of external control (EC) data to augment an RCT's subgroup analysis. We propose "harmonized" estimators of subpopulation-specific treatment effects that leverage EC data. Our approach modifies an initial estimate of the subgroup-specific treatment effects obtained through a user-supplied method (e.g., linear regression) applied to the RCT and EC data. The key idea is to alter these initial subgroup-specific effect estimates to make them coherent with a robust estimate, using only RCT data, of the average treatment effect in the overall enrolled population. In particular, we make the weighted average of the resulting harmonized subgroup-specific estimates match the average effect in the overall population. Through theory and simulation results we show that in important realistic settings the harmonized estimator can both eliminate the bias from using external data and greatly reduce the uncertainty of the subgroup analysis.
ABSTRACT: State-level mortality data in the United States is subject to reporting delays of up to 18 weeks, causing gaps between reported and true mortality in the short-term. Existing methods for correcting gaps from reporting delays do not appropriately account for seasonality or time trends in prior lags. We use state-level CDC and DPH data from January 2015-December 2021 to develop a model that accurately predicts the true death count on a weekly basis, thereby reconciling the gap between reported and true deaths. Specifically, we built both a non-parametric model and an estimator based on empirical lag patterns that flexibly account for seasonality and trends to obtain unbiased estimates of gaps with appropriate measures of uncertainty.
Back to SPH Biostatistics |
Maintained by the
Biostatistics Webmaster
Last Update: May 2, 2023 |