Department of Biostatistics
Quantitative Issues in Cancer Research Working Seminar

2019 - 2020

Organizer: Dr. Jill Lundell


Schedule: Mondays, 1:00-1:50 p.m.
Kresge 201 (unless otherwise notified)

Contract All | Expand All
Seminar Description
There are more than one million new cancer cases every year in the United States. An additional 5-8 million people are living with cancer. Research on cancer has greatly influenced the development of statistical methods in the past two decades and is likely to continue to do so in the future. This working seminar will be a forum for the discussion of current methodologic developments as well as cancer research having a strong quantitative basis. The working seminars will include expository reviews of special topics as well as the presentation of new research. All students and faculty are invited to attend and participate.


September 16

Matt Ploenzke
Doctoral Student, Department of Biostatistics, Harvard University

"CNN Design Principles for Genomic Sequence Applications"
ABSTRACT: Convolutional neural networks are a powerful tool for learning sequence-function relationships in the human genome. Often these computational approaches are implemented with architectures seemingly chosen at random, ultimately based on the model that attains, for example, the lowest test set accuracy. However, there is little-to-no principled approach guiding the CNN architectural design including such parameters as model depth, model width, convolutional activation function, etc. We show the impact architecture has on both model accuracy/precision, as well as the learned representations (e.g. motifs), and find novel modifications to existing architectures to significantly improve downstream model interpretation. We elucidate the utility of divergent activation functions to provide a preliminary set of design principles in the context of DNA sequence data and showcase the benefits of such an approach to ChIP-seq binding data downloaded from the Encode Project.

September 23

Margaux Hujoel
Doctoral Student, Department of Biostatistics, Harvard University

"Identifiability of Cancer-Resistance Genotypes"
ABSTRACT: There is an open question as to whether cancer-resistant genotypes exist (Klein 2009 PNAS). Although individuals who don't develop cancer may lack mutations that make them susceptible, there is an alternative hypothesis that some of these individuals may in fact have genotypes that make them resistant to cancer. Rather than viewing cancer resistance as a "mirror image" of susceptibility, we aim to study cancer resistance as a distinct entity. Through simulation studies, we try to address two main questions. If a small group of individuals were resistant to cancer, would we be able to identify that such a group existed? And if so, could we estimate the prevalence of this resistance genotype?

September 30

Daniel Li
Doctoral Student, Department of Biostatistics, Harvard University

"Hospital Volume and Hospital Rankings"
ABSTRACT: The Centers for Medicare and Medicaid Services penalizes hospitals with the worst quality measures, but small hospital volumes may affect the accuracy of such hospital profiling. Therefore, we studied the relationship between small hospital volume and one such quality measure, the standardized infection ratio, in correctly identifying poorly performing hospitals. We derived a formula and created an algorithm for calculating power and false positive rates of the standardized infection ratio ranking a hospital in the lowest quartile, and we conducted simulation analyses based on data from HCA Healthcare (2014-2016). This data looked at surgical site infections in colon surgery patients among a nationally diverse group of hospitals. We found that as hospital volume increased, power generally increased and false positive rates generally decreased. Outcomes with overall proportions closer to 0.50 and outcomes with larger variability in hospital event rates required smaller hospital volumes to achieve a minimum power or control for a maximum false positive rate. Minimum hospital volumes and predicted events criteria are required to make evaluating hospitals reliable, and these criteria should vary by overall event rate and between hospital variability. This modification to current practice can help prevent unmerited financial penalties for hospitals.

October 7

Maya Ramchandran
Doctoral Student, Department of Biostatistics, Harvard University

"Tree-Weighting for Multi-Study Ensemble Learners"
ABSTRACT: Multi-study learning uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. This article considers novel weighting approaches for constructing tree-based ensemble learners in this setting. Using Random Forests as a single-study learner, we compare weighting each forest to form the ensemble, to extracting the individual trees trained by each Random Forest and weighting them directly.

We find that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor. Furthermore, we explore how ensembling weights correspond to tree structure, to shed light on the features that determine whether weighting trees directly is advantageous. Finally, we apply our approach to genomic datasets and show that weighting trees improves upon the basic multi-study learning paradigm.

October 21

Jill Lundell
Research Fellow, Department of Biostatistics, Harvard T.H. Chan School of Public Health

"Exploring the Effects of Heritability in Genome-wide Association Studies"
ABSTRACT: Heritability is the amount of phenotypic variation due to genotype. GWAS seeks to find SNPs that are related to a particular phenotype. Despite a likely connection between heritability and the ability to identify potentially important SNPs, heritability is often overlooked in GWAS. I will share some observations I have made using simulated data to examine the ability of different GWAS methods to find functional SNPs. GWAS were done using data with weak, moderate, and strong heritability. Some of the methods tested are conditional logistic regression, LASSO, and random forests.

October 28 (FXB G10)

Amy Zhou
Doctoral Student, Department of Biostatistics, Harvard University

"BioBERT: Adaptation of BERT for Biomedical Text Mining (Journal Club)"
ABSTRACT: With the proliferation of biomedical documents, natural language processing (NLP) is a useful, increasingly relevant methodology for extracting information from biomedical texts. I will be presenting the paper on BioBERT, a domain-specific language model pre-trained and tailored for biomedical texts. We will discuss the differences between BioBERT and BERT (Bidirectional Encoder Representations from discuss Transformers), including details about the pre-training and fine-tuning process, and overall performance on biomedical text mining tasks.

November 4

Cathy Wang
Doctoral Student, Department of Biostatistics, Harvard University

"A Comprehensive Validation of Models for Prediction of Mismatch Repair Gene Mutations"
ABSTRACT: Lynch syndrome is the most common colorectal cancer syndrome caused by germline mutations in the mismatch repair (MMR) genes. Prediction models including Leiden, MMRpredict, PREMM5, and MMRpro are used to predict the probability of an individual carrying a mutation in the MMR genes. Recently, MMRpro was updated with new penetrance estimates of the MMR genes on colorectal cancer, and, to date, these have not been validated. The purpose of this study is to evaluate the predictive performance of the four models in individuals with a family history of colorectal and endometrial cancer. We performed a validation study of the four models on 565 individuals from clinic-based families in the United States. Risk prediction based on the four models was compared to germline testing results and evaluated for discrimination, calibration, and predictive accuracy. These models can serve as useful tools for identifying individuals who are at high risk of carrying a Lynch syndrome mutation. We recommend that clinicians and genetic counselors use these models in an informed manner to better implement effective management and targeted surveillance strategies for individuals with Lynch syndrome.

November 18

Eric Dunipace
Doctoral Student, Department of Biostatistics, Harvard University

"Interpretable Posterior Summaries Using the Wasserstein Distance"
ABSTRACT: In the current computing age, models can have hundreds or even thousands of parameters; however, such large models decrease the ability to interpret and communicate individual parameters. Reducing the dimensionality of the parameter space in the estimation phase is a commonly used technique, but less work has focused on selecting subsets of the parameters to focus on for interpretation—especially in Bayesian settings. To solve this gap, we introduce a new method that uses the 2-Wasserstein distance to select a subset of the parameter space for interpretation. After estimating a posterior distribution, users can budget how many parameters they wish to interpret and our method selects a reduced posterior of the desired dimension that minimizes the distance to the full posterior. We provide simulation results demonstrating the effectiveness of the proposed method and apply the method to cancer data.

November 25 (Canceled)

Eric Cohn
Doctoral Student, Department of Biostatistics, Harvard University

"Micro-Randomized Controlled Trials and Cancer Research"
ABSTRACT: The micro-randomized controlled trial (micro-RCT) is an experimental design used to develop and evaluate just-in-time, adaptive interventions. These designs have diverse applications—from rapid-cycle evaluations of mobile health interventions to optimizing cellphone-administered surveys to maximize response rates—many of which are relevant to cancer research. In this talk, I will present an overview of the micro-RCT design, motivate its use in public health and cancer research more specifically, and summarize some methodological and statistical considerations relevant to these designs.

December 2

Jonathan Luu
Doctoral Student, Department of Biostatistics, Harvard University

"CONOS: Clustering on Network of Samples (Journal Club)"
ABSTRACT: Single-cell RNA sequencing (scRNA-seq) is a powerful approach to learning about cancer at the resolution of individual cells. However, many scRNA-seq studies often include multiple individuals, conditions, and tissues which present technical and conceptual analysis challenges. Furthermore, recent consortium efforts have been introduced to generate atlases of these single-cell datasets with thousands of systematically different samples. Recent alignment methods, although flexible, were designed for relatively small samples. CONOS presents a graphical approach to digest these large heterogeneous samples, identify recurrent cell populations, and analyze the relationships between cells in different samples.

December 9

Alejandro Reyes
Research Fellow, Department of Biostatistics, Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health

"A Survey of Genome Topology in Colorectal Cancer Reveals Large-scale Compartmental Changes that Restrain Malignant Progression"
ABSTRACT: Widespread changes to DNA methylation and histone modifications are well documented in cancer, but the fate of higher-order chromosomal structure remains obscure. Here we integrated topological maps for colon tumors and normal colons with epigenetic and transcriptional data to characterize alterations to chromatin loops, topologically-associated domains and large-scale compartments. Tumors exhibit profound compartmental reorganization, losing the normal spatial partitioning between the canonical open and closed genome compartments. This reorganization is accompanied by compartment-specific DNA hypomethylation and chromatin state changes. We also identified a novel compartment at the interface between canonical compartments distinguished by unique chromatin state and tumor-associated changes. Remarkably, the compartmental shifts are actually shared features of cells that have accumulated excess divisions. They likely restrain malignant progression by repressing genes linked to stem cell proliferation and invasion, while inducing anti-tumor immunity genes. Our findings call into question the conventional view that tumor-associated epigenomic changes are primarily oncogenic.

December 16

Gopal Kotecha
Doctoral Student, Department of Biostatistics, Harvard University

"Uncertainty Directed Factorial Clinical Trials"
ABSTRACT: Modern clinical medicine has seen an increase in the delivery of combinations of treatments, but many of these combinations and potential alternatives have not been evaluated in a trial setting. To address this evaluation gap, we introduce a bayes-adaptive trial design for the factorial setting. This class of trial designs uses a decision theoretic framework to randomize patients to treatments. Treatment assignment is carried out with the aim of maximizing expected utility in order to determine the best treatment combination from a large space of potential treatment combinations at the end of the trial. We use Bayesian models to model the data and explicit information metrics tailored to the problem to provide accuracy measures of the final selection of optimal treatment combinations. The proposed approach and the simulated scenarios used in the evaluation of these factorial BUD designs are motivated by a pragmatic trial at our institution that uses electronic health record based decision support to reduce inappropriate prescribing for older adults.

February 3

Eric Cohn
Doctoral Student, Department of Biostatistics, Harvard University

"Micro-Randomized Controlled Trials and Cancer Research"
ABSTRACT: The micro-randomized controlled trial (micro-RCT) is an experimental design used to develop and evaluate just-in-time, adaptive interventions. These designs have diverse applications—from rapid-cycle evaluations of mobile health interventions to optimizing cellphone-administered surveys to maximize response rates—many of which are relevant to cancer research. In this talk, I will present an overview of the micro-RCT design, motivate its use in public health and cancer research more specifically, and summarize some methodological and statistical considerations relevant to these designs.

February 10

Margaux Hujoel
Doctoral Student, Department of Biostatistics, Harvard University

"Identifiability of a Cancer-resistance Genotypes"
ABSTRACT: There is an open question as to whether cancer-resistant genotypes exist (Klein 2009 PNAS). Although individuals who don't develop cancer may lack mutations that make them susceptible, there is an alternative hypothesis that some of these individuals may in fact have genotypes that make them resistant to cancer. Rather than viewing cancer resistance as a "mirror image" of susceptibility, we aim to study cancer resistance as a distinct entity. Through simulation studies, we try to address two main questions. If a small group of individuals were resistant to cancer, would we be able to identify that such a group existed? And if so, could we estimate the prevalence of this resistance genotype?

February 24

Daniel Li
Doctoral Student, Department of Biostatistics, Harvard University

"Regularized Best Subset Selection: Concepts for Polygenic Risk Scores and Prediction"
ABSTRACT: We will talk about recent developments with regularized best subset selection. We will go over empirical and theoretical results, and discuss important ideas and concepts for applications with a focus on polygenic risk scores.

March 2

Cathy Wang
Doctoral Student, Department of Biostatistics, Harvard University

"Multi-Study Semi-Supervised Learning: A First Look"
ABSTRACT: In many machine learning applications, the gold standard label is difficult or expensive to obtain. Semi-supervised learning (SSL) methods leverage unlabeled data to improve a model’s performance when only limited labeled data is available. We investigate the replicability of the performance of SSL classifiers in a multi-study setting.

March 9

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

March 23

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

March 30

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

April 6

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

April 13

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

April 20

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

April 27

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

May 4

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given

May 11

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given



Back to SPH Biostatistics Maintained by the Biostatistics Webmaster
Last Update: February 25, 2020