Department of Biostatistics
Quantitative Issues in Cancer Research Working Seminar

2022 - 2023

Organizer: Dr. Daniel Schwartz

Schedule: Mondays, 1:00-1:50 p.m.
Zoom (unless otherwise notified)

Contract All | Expand All
Seminar Description
There are more than one million new cancer cases every year in the United States. An additional 5-8 million people are living with cancer. Research on cancer has greatly influenced the development of statistical methods in the past two decades and is likely to continue to do so in the future. This working seminar will be a forum for the discussion of current methodologic developments as well as cancer research having a strong quantitative basis. The working seminars will include expository reviews of special topics as well as the presentation of new research. All students and faculty are invited to attend and participate.

This working group will be meeting remotely until further notice. Contact Daniel Schwartz (email linked above) for Zoom ID and password.

September 12

Daniel Schwartz, Ph.D.
Research Fellow, Department of Biostatistics, Harvard T.H. Chan School of Public Health

"Historical Borrowing in Phase 2 Efficacy Trials with the SPx Model"
ABSTRACT: We consider the problem of borrowing information from historical controls to reduce the control group size and improve treatment effect estimation in subsequent randomized clinical trials. The key statistical challenge is to appropriately control the degree of information borrowing so the historical data are relied upon when relevant but discounted when irrelevant. Popular methods attempt to compromise between these goals by using priors that allow the amount of historical borrowing to adapt to how similar the historical and new trial data appear. We propose the SPx method, standing for "synthetic prior with covariates", which extends existing approaches by accounting for different sources of heterogeneity between historical data and current trial data. The key statistical tool in SPx is model averaging that allows diverse and dynamic borrowing. It is formulated to borrow trial-level summary statistics that are easily found in the literature. This may be useful for practical situations when patient-level data are not available. We show that when combined with a simple two-stage adaptive design, historical borrowing via SPx can substantially reduce the needed control group size compared to alternative methods while maintaining or improving the Frequentist power and Type I error rate.

September 19

Lin Chen, Ph.D.
Associate Professor of Biostatistics, Department of Public Health Sciences, University of Chicago

"Robust two-sample Mendelian randomization methods"
ABSTRACT: Mendelian randomization (MR) harnesses genetic variants as instrumental variables (IVs) to study the causal effect of exposure on outcome. Two-sample MR recapitalizes on summary statistics from genome-wide association studies, and it has achieved many successes in identifying genetically regulated risk exposures. In this talk, I will present our recent works in studying two types of exposure traits, molecular traits and complex traits. When considering gene expression as exposure in transcriptome-wide MR (TWMR) analyses, the eQTLs (expression-quantitative-trait-loci) may have pleiotropic effects or be correlated with variants that have effects on disease not via expression. The presence of those invalid IVs would lead to biased inference. Moreover, the number of eQTLs as IVs for a gene is generally limited, making the detection of invalid IVs challenging. We propose methods for accurate TWMR inference in the presence of invalid IVs, by leveraging multi-tissue and/or multi-omics data and making identifiable the IV-specific pleiotropic effects. In studying complex trait as exposure, a challenge is when IVs are associated with unmeasured confounders, i.e., when correlated horizontal pleiotropy (CHP) arises. Such confounders could be a shared gene or inter-connected pathways underlying exposure and outcome. We propose a method for estimating causal effect while identifying IVs with CHP and accounting for estimation uncertainty. For those IVs, we map their cis-associated genes and enriched pathways to inform shared genetic etiology underlying exposure and outcome.

September 26

Jonathan Luu
Doctoral Student, Department of Biostatistics, Harvard University

"Analysis of semi-continuous clustered data with competing risks"
ABSTRACT: In the nursing home setting, costs and healthcare utilization are two common outcomes of interest. However, cost data typically follows a semi-continuous distribution, with a large concentration of zero values and a right skewed distribution of positive values. First, I will discuss the logistic-lognormal two-part model commonly used to analyze this data. Furthermore, I will talk about the Bayesian semiparametric framework for the random effects we are proposing to extend this model’s flexibility. Second, metrics often used to compare semi-continuous data do not consider that the data arise from two distinct stochastic processes: one that governs the occurrence of zeros and the other determining the observed value conditional on it being a non-zero response. I will discuss two-dimensional metrics we are developing that jointly look at performance in terms of whether more than expected people are accruing non-zero costs and whether those folks who do accrue non-zero costs are accruing more than expected such costs.

October 3

Isabella Grabski
Doctoral Student, Department of Biostatistics, Harvard University

"Bayesian approaches to multi-study matrix decompositions of heterogeneous genomics datasets"
ABSTRACT: Analyzing multiple studies allows leveraging data from a range of sources and populations, but until recently, there have been limited methodologies to approach the joint unsupervised analysis of multiple high-dimensional studies. Recent methods can identify shared signals across datasets, as well as signals specific to particular groups. However, especially as the number of datasets grows, we expect the presence of signals with more complex sharing patterns. We propose two flexible Bayesian multi-study latent feature models to address this problem. The first is a combinatorial multi-study factor analysis method, which identifies latent factors that can be shared by any combination of studies. We model the subsets of studies that share latent factors with an Indian Buffet Process, and demonstrate our method's utility not only in dimension reduction but also in covariance estimation. The second is an extension of this approach to multi-study non-negative matrix factorization, specialized to application in the characterization of mutational signatures from tumor genomes. We develop both fully unsupervised and semi-supervised approaches, which allows novel signatures to be discovered and known signatures to be recovered. Finally, we incorporate tumor-level covariates into the model to estimate associations with signatures, using a non-local spike-and-slab prior to enforce biologically plausible sparsity. We demonstrate both approaches in integrating multiple datasets from breast and colorectal cancer respectively.

October 17

Luke Benz
Doctoral Student, Department of Biostatistics, Harvard University

"mixWAS: A Federated Algorithm for Testing Variant Level Associations Across Mixed Type Phenotypes"
ABSTRACT: Methods that leverage cross-phenotype associations or pleiotropy in risk prediction have been shown to achieve improved performance compared to single-phenotype analyses. We introduce mixWAS, a new cross-phenotype association test for mixed data type phenotypes tailored to work with data in a federated setting, when multiple sites can not share individual level data due to privacy restrictions. Given the wide range of possible forms pleiotropy may take, mixWAS is designed to be powerful against both dense alternatives, where many phenotypes are associated with the SNP in question, as well as sparse alternatives, where the majority of phenotypes are not associated with the SNP. In this talk, we present background and motivation for the method, power simulations comparing mixWAS to existing PheWAS methods, and discuss preliminary results of applying mixWAS to real EHR data from eMERGE to identify pleiotropic SNPs.

October 24

Carmen B. Rodriguez
Doctoral Student, Department of Biostatistics, Harvard University

"A multivariate beta mixture model approach to examining racial/ethnic and socioeconomic disparities of care in endometrial cancer patients of Massachusetts"
ABSTRACT: Endometrial cancer (EC) is the most common gynecologic cancer in the United States affecting 1 in 37 women each year. Over the past few decades, the incidence and mortality of EC has been increasing for all racial-ethnic groups, with the highest rate of increase observed among racial-ethnic minority groups. African American women have on average 55% higher 5-year mortality risk than white women, and like other minority groups are vulnerable to receiving suboptimal care due to differences in the cultural and socioeconomic environments in which they reside. Previous research has used factors, such as educational attainment, household income or occupation as proxies for SES, however, SES as a social determinant of health embodies multiple factors that in combination better explain inequities in health. We aim to take a multifactorial approach in how we examine racial/ethnic and socioeconomic factors leading to bias and disparities in the receipt of optimal care for EC patients. Using census tract aggregate level data and patient-level information from the Massachusetts Cancer Registry, we will apply a Multivariate Beta Mixture Model to cluster several social determinants of health to better understand the social dimension of EC care and treatment in Massachusetts.

October 31

Phillip Nicol
Doctoral Student, Department of Biostatistics, Harvard University

"Model based dimensionality reduction for single cell RNA-seq with generalized bilinear models
ABSTRACT: Dimensionality reduction is a critical step in the analysis of single cell RNA-seq (scRNA-seq) data. The standard approach is to apply a transformation to the count matrix followed by principal component analysis (PCA). However, this approach can spuriously indicate heterogeneity where it does not exist and mask true heterogeneity where it does exist. An alternative approach is to directly model the counts, but existing methods tend to be computationally intractable on large data sets and do not quantify uncertainty in the low dimensional representation. To address these problems, we develop scGBM , a novel method for model based dimensionality reduction of scRNA-seq data. scGBM employs a scalable algorithm to fit a Poisson bilinear model to datasets with millions of cells and quantifies the uncertainty in each cell's latent position. Furthermore, scGBM leverages these uncertainties to assess the confidence associated with a given cell clustering. On real and simulated single cell data, we find that scGBM produces low dimensional embeddings that better capture relevant biological information while removing unwanted variation. scGBM is publicly available as an R package.

November 7

Mónica M. Robles Fontán
Doctoral Student, Department of Biostatistics, Harvard University

"Effectiveness Estimates of Three Covid-19 Vaccines Based on Observational Data from Puerto Rico"
ABSTRACT: On July 15, 2021, with 58% of the population fully vaccinated, the start of a COVID-19 surge was observed in Puerto Rico. On July 22, 2021, the government of Puerto Rico started imposing a series of strict vaccine mandates. Two months later, over 70% of the population was vaccinated, more than in any US state, and laboratory-confirmed SARS-CoV-2 had dropped substantially. The decision to impose mandates, as well as current Department of Health recommendations related to boosters, were guided by the data and the effectiveness estimates presented here. Between December 15, 2020, when the vaccination process began in Puerto Rico, and October 15, 2021, 2,276,966 individuals were fully vaccinated against COVID-19. During this period 112,726 laboratory-confirmed SARS-CoV-2 infections were reported. These data permitted us to quantify the outcomes of the immunization campaign and to compare effectiveness of the mRNA-1273 (Moderna), BNT162b2 (Pfizer), and Ad26.COV2.S (J&J) vaccines. We obtained vaccination status, SARS-CoV-2 test results, and COVID-19 hospitalizations and deaths, from the Department of Health. We fit statistical models that adjusted for time-varying incidence rates and age group to estimate vaccine effectiveness, since the time of vaccination, against lab-confirmed SARS-CoV-2 infection, and COVID-19 hospitalization and death. Two weeks after final dose, the mRNA-1273, BNT162b2, and Ad26.COV2.S vaccines had an effectiveness of 90% (95% CI: 88–91), 87% (85–88), and, 64% (58–69), respectively. After five months, effectiveness waned to about 70%, 50%, and 40%, respectively. We found no evidence that effectiveness was different after the Delta variant became dominant. For those infected, the vaccines provided further protection against COVID-19 hospitalization and deaths across all age groups, and this conditional effect did not wane in time. The mRNA-1273 and BNT162b2 vaccines were highly effective across all age groups. They were still effective after five months although the protection against SARS-CoV-2 infection waned. The Ad26.COV2.S vaccine was effective but to a lesser degree compared to the mRNA vaccines. Although, conditional on infection, protection against adverse outcomes did not wane, the waning in effectiveness resulted in a decreased protection against serious COVID-19 outcomes across time.

November 14 (Canceled)

Amy Zhou
Doctoral Student, Department of Biostatistics, Harvard University

"Analysis of Semi-Competing Risks for Case-Cohort Study"
ABSTRACT: The case-cohort study design is well-known as a cost-effective outcome-dependent sampling scheme for large observational studies. However, when interest lies in semi-competing risks, a setting where a non-terminal event and a terminal event (usually death) are investigated simultaneously, there are currently no statistical methods for the analysis of data arising from a case-cohort design. I will discuss the method we are developing for analyzing such data and the framework for designing such studies in resource-limited settings.

November 21

Gopal Kotecha
Doctoral Student, Department of Biostatistics, Harvard University

"Shared Control Data Trial Networks"
ABSTRACT: We outline the clinical trial landscape of Glioblastoma Multiforme, with suggestions on how to best use the clinical information provided by this population. We review the advantages, disadvantages and barriers of various clinical trial approaches in the context of this disease. We further propose shared-control arm approaches to experimentation, and provide initial simulation data to demonstrate its benefits and trade-offs.

November 28

Kimberly Greco
Doctoral Student, Department of Biostatistics, Harvard University

"Roller versus centrifugal blood pumps for pediatric extracorporeal membrane oxygenation (ECMO)"
ABSTRACT: Extracorporeal membrane oxygenation (ECMO) is a life support technology used for the management of cardiopulmonary failure. ECMO circuits incorporate either roller or centrifugal blood pumps to achieve circuit flow and support cardiac output. Since 2010, the use of centrifugal pumps in pediatric medicine has increased with technological advances and ease of use; however, recent clinical and registry-based studies have found higher rates of complications in small children supported with centrifugal pumps relative to roller pumps. Using the Extracorporeal Life Support Organization (ELSO) registry, we evaluated the association of blood pump with in-hospital mortality among smaller (10kg) and larger (10kg) children. We implemented a combined imputation, inverse propensity weight, and bootstrap approach to account for institutional variability in treatment patterns and obtain valid estimates of treatment effect.

December 5

Jodeci Wheaden
Doctoral Student, Department of Biostatistics, Harvard University

"Realizing the Potential of Cancer Prevention - The Role of Implementation Science"
ABSTRACT: This talk will discuss the following paper: Emmons, K. M. and Colditz, G. A. (2017) Realizing the Potential of Cancer Prevention — The Role of Implementation Science. N Engl J Med. Massachusetts Medical Society. DOI: 10.1056/nejmsb1609101.

December 12

Cheng-Zhong Zhang, PhD
Assistant Professor of Biomedical Informatics, Harvard Medical School

"Revisiting the dynamic genome: genetic and epigenetic variation from the breakage-fusion-bridge cycles"
ABSTRACT: None Given.

January 30 (Canceled)

Anuraag Gopaluni
Doctoral Student, Department of Biostatistics, Harvard University

"Methods for accurate real-time estimates of death in the context of reporting delays"
ABSTRACT: State-level mortality data in the United States is subject to reporting delays of up to 18 weeks, causing gaps between reported and true mortality in the short-term. Existing methods for correcting gaps from reporting delays do not appropriately account for seasonality or time trends in prior lags. We use state-level CDC and DPH data from January 2015-December 2021 to develop a model that accurately predicts the true death count on a weekly basis, thereby reconciling the gap between reported and true deaths. Specifically, we built both a non-parametric model and an estimator based on empirical lag patterns that flexibly account for seasonality and trends to obtain unbiased estimates of gaps with appropriate measures of uncertainty.

February 6

Amy Zhou
Doctoral Student, Department of Biostatistics, Harvard University

"Semi-Competing Risks for the Case-Cohort Study Design"
ABSTRACT: The advent of large observational databases and cohort studies introduces a rich source of data. Often, specific risk factors of interest to researchers may either not have been collected in resource-limited settings or are difficult to ascertain due to cost constraints. The case-cohort study design is well-known as a cost-effective outcome-dependent sampling scheme for studies embedded within large cohort studies. However, when interest lies in semi-competing risks, a setting where a non-terminal event and a terminal event (usually death) are investigated simultaneously, there are currently no statistical methods for the analysis of data arising from a case-cohort design. We propose a model for estimation and inference for this study design and provide initial simulation data for this framework.

February 13

Jonathan Luu
Doctoral Student, Department of Biostatistics, Harvard University

"Duration of viral shedding with postvaccination SARS-CoV-2 infections & Linear growth and weight gain from birth to age two: A longitudinal cohort study in Amhara, Ethiopia"
ABSTRACTS: Project 1: Isolation guidelines for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are largely derived from data collected prior to the emergence of the delta variant. We followed a cohort of ambulatory patients with postvaccination breakthrough SARS-CoV-2 infections with longitudinal collection of nasal swabs for SARS-CoV-2 viral load quantification, whole-genome sequencing, and viral culture. All delta variant infections in our cohort were symptomatic, compared with 64% of non-delta variant infections. Symptomatic delta variant breakthrough infections were characterized by higher initial viral load, longer duration of virologic shedding by PCR, greater likelihood of replication-competent virus at early stages of infection, and longer duration of culturable virus compared with non-delta variants. The duration of time since vaccination was also correlated with both duration of PCR positivity and duration of detection of replication-competent virus. Nonetheless, no individuals with symptomatic delta variant infections had replication-competent virus by day 10 after symptom onset or 24 hours after resolution of symptoms. These data support US CDC isolation guidelines as of November 2021, which recommend isolation for 10 days or until symptom resolution and reinforce the importance of prompt testing and isolation among symptomatic individuals with delta breakthrough infections.

Project 2: The Sustainable Development Goals set out an ambitious goal to end all forms of malnutrition by 2030. Although there has been a reduction in stunting (low height for age) and wasting (low height for weight), the prevalence of malnutrition in Ethiopia is still high. To improve nutritional outcomes, granular data are needed to determine key time points for growth and weight faltering. The Birhan maternal and child health study in North Shewa Zone in Amhara, Ethiopia, collected longitudinal data used in this study to determine key time points for growth and weight faltering. We investigated growth and weight faltering at birth, four weeks, six, 12 and 24 months. Our findings indicate that median population-level length and weight among children in this population are consistently below global standards from birth to age two. Growth velocity and weight gain was slowest compared to global standards during the neonatal period and after children reached six months of age. The prevalence of stunting was highest at age two (56.7%), whereas the prevalence of wasting was lower and peaked at birth (18.4%). Incidence of stunting increased over time whereas it decreased for wasting. We also found substantial within-individual heterogeneity in anthropometric measurements. Overall, the evidence from this study highlights a chronically malnourished population compared to global standards, with much of the burden driven by growth and weight faltering during the pre- and neonatal periods as well as after 6 months of age. To end all forms of malnutrition, growth and weight faltering in populations such as that in young children in Amhara, Ethiopia needs to be addressed.

Back to SPH Biostatistics Maintained by the Biostatistics Webmaster
Last Update: February 2, 2023