Department of Biostatistics
Quantitative Issues in Cancer Research Working Seminar

2019 - 2020

Organizer: Dr. Jill Lundell


Schedule: Mondays, 1:00-1:50 p.m.
Kresge 201 (unless otherwise notified)

Contract All | Expand All
Seminar Description
There are more than one million new cancer cases every year in the United States. An additional 5-8 million people are living with cancer. Research on cancer has greatly influenced the development of statistical methods in the past two decades and is likely to continue to do so in the future. This working seminar will be a forum for the discussion of current methodologic developments as well as cancer research having a strong quantitative basis. The working seminars will include expository reviews of special topics as well as the presentation of new research. All students and faculty are invited to attend and participate.


September 16

Matt Ploenzke
Doctoral Student, Department of Biostatistics, Harvard University

"CNN Design Principles for Genomic Sequence Applications"
ABSTRACT: Convolutional neural networks are a powerful tool for learning sequence-function relationships in the human genome. Often these computational approaches are implemented with architectures seemingly chosen at random, ultimately based on the model that attains, for example, the lowest test set accuracy. However, there is little-to-no principled approach guiding the CNN architectural design including such parameters as model depth, model width, convolutional activation function, etc. We show the impact architecture has on both model accuracy/precision, as well as the learned representations (e.g. motifs), and find novel modifications to existing architectures to significantly improve downstream model interpretation. We elucidate the utility of divergent activation functions to provide a preliminary set of design principles in the context of DNA sequence data and showcase the benefits of such an approach to ChIP-seq binding data downloaded from the Encode Project.

September 23

Margaux Hujoel
Doctoral Student, Department of Biostatistics, Harvard University

"Identifiability of Cancer-Resistance Genotypes"
ABSTRACT: There is an open question as to whether cancer-resistant genotypes exist (Klein 2009 PNAS). Although individuals who don't develop cancer may lack mutations that make them susceptible, there is an alternative hypothesis that some of these individuals may in fact have genotypes that make them resistant to cancer. Rather than viewing cancer resistance as a "mirror image" of susceptibility, we aim to study cancer resistance as a distinct entity. Through simulation studies, we try to address two main questions. If a small group of individuals were resistant to cancer, would we be able to identify that such a group existed? And if so, could we estimate the prevalence of this resistance genotype?

September 30

Daniel Li
Doctoral Student, Department of Biostatistics, Harvard University

"Hospital Volume and Hospital Rankings"
ABSTRACT: The Centers for Medicare and Medicaid Services penalizes hospitals with the worst quality measures, but small hospital volumes may affect the accuracy of such hospital profiling. Therefore, we studied the relationship between small hospital volume and one such quality measure, the standardized infection ratio, in correctly identifying poorly performing hospitals. We derived a formula and created an algorithm for calculating power and false positive rates of the standardized infection ratio ranking a hospital in the lowest quartile, and we conducted simulation analyses based on data from HCA Healthcare (2014-2016). This data looked at surgical site infections in colon surgery patients among a nationally diverse group of hospitals. We found that as hospital volume increased, power generally increased and false positive rates generally decreased. Outcomes with overall proportions closer to 0.50 and outcomes with larger variability in hospital event rates required smaller hospital volumes to achieve a minimum power or control for a maximum false positive rate. Minimum hospital volumes and predicted events criteria are required to make evaluating hospitals reliable, and these criteria should vary by overall event rate and between hospital variability. This modification to current practice can help prevent unmerited financial penalties for hospitals.

October 7

Maya Ramchandran
Doctoral Student, Department of Biostatistics, Harvard University

"Tree-Weighting for Multi-Study Ensemble Learners"
ABSTRACT: Multi-study learning uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. This article considers novel weighting approaches for constructing tree-based ensemble learners in this setting. Using Random Forests as a single-study learner, we compare weighting each forest to form the ensemble, to extracting the individual trees trained by each Random Forest and weighting them directly.

We find that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor. Furthermore, we explore how ensembling weights correspond to tree structure, to shed light on the features that determine whether weighting trees directly is advantageous. Finally, we apply our approach to genomic datasets and show that weighting trees improves upon the basic multi-study learning paradigm.

October 21

Jill Lundell
Research Fellow, Department of Biostatistics, Harvard T.H. Chan School of Public Health

"Exploring the Effects of Heritability in Genome-wide Association Studies"
ABSTRACT: Heritability is the amount of phenotypic variation due to genotype. GWAS seeks to find SNPs that are related to a particular phenotype. Despite a likely connection between heritability and the ability to identify potentially important SNPs, heritability is often overlooked in GWAS. I will share some observations I have made using simulated data to examine the ability of different GWAS methods to find functional SNPs. GWAS were done using data with weak, moderate, and strong heritability. Some of the methods tested are conditional logistic regression, LASSO, and random forests.

October 28 (FXB G10)

Amy Zhou
Doctoral Student, Department of Biostatistics, Harvard University

"BioBERT: Adaptation of BERT for Biomedical Text Mining (Journal Club)"
ABSTRACT: With the proliferation of biomedical documents, natural language processing (NLP) is a useful, increasingly relevant methodology for extracting information from biomedical texts. I will be presenting the paper on BioBERT, a domain-specific language model pre-trained and tailored for biomedical texts. We will discuss the differences between BioBERT and BERT (Bidirectional Encoder Representations from discuss Transformers), including details about the pre-training and fine-tuning process, and overall performance on biomedical text mining tasks.

November 4

Cathy Wang
Doctoral Student, Department of Biostatistics, Harvard University

"A Comprehensive Validation of Models for Prediction of Mismatch Repair Gene Mutations"
ABSTRACT: Lynch syndrome is the most common colorectal cancer syndrome caused by germline mutations in the mismatch repair (MMR) genes. Prediction models including Leiden, MMRpredict, PREMM5, and MMRpro are used to predict the probability of an individual carrying a mutation in the MMR genes. Recently, MMRpro was updated with new penetrance estimates of the MMR genes on colorectal cancer, and, to date, these have not been validated. The purpose of this study is to evaluate the predictive performance of the four models in individuals with a family history of colorectal and endometrial cancer. We performed a validation study of the four models on 565 individuals from clinic-based families in the United States. Risk prediction based on the four models was compared to germline testing results and evaluated for discrimination, calibration, and predictive accuracy. These models can serve as useful tools for identifying individuals who are at high risk of carrying a Lynch syndrome mutation. We recommend that clinicians and genetic counselors use these models in an informed manner to better implement effective management and targeted surveillance strategies for individuals with Lynch syndrome.

November 18

Eric Dunipace
Doctoral Student, Department of Biostatistics, Harvard University

"Interpretable Posterior Summaries Using the Wasserstein Distance"
ABSTRACT: In the current computing age, models can have hundreds or even thousands of parameters; however, such large models decrease the ability to interpret and communicate individual parameters. Reducing the dimensionality of the parameter space in the estimation phase is a commonly used technique, but less work has focused on selecting subsets of the parameters to focus on for interpretation—especially in Bayesian settings. To solve this gap, we introduce a new method that uses the 2-Wasserstein distance to select a subset of the parameter space for interpretation. After estimating a posterior distribution, users can budget how many parameters they wish to interpret and our method selects a reduced posterior of the desired dimension that minimizes the distance to the full posterior. We provide simulation results demonstrating the effectiveness of the proposed method and apply the method to cancer data.

November 25 (Canceled)

Eric Cohn
Doctoral Student, Department of Biostatistics, Harvard University

"Micro-Randomized Controlled Trials and Cancer Research"
ABSTRACT: The micro-randomized controlled trial (micro-RCT) is an experimental design used to develop and evaluate just-in-time, adaptive interventions. These designs have diverse applications—from rapid-cycle evaluations of mobile health interventions to optimizing cellphone-administered surveys to maximize response rates—many of which are relevant to cancer research. In this talk, I will present an overview of the micro-RCT design, motivate its use in public health and cancer research more specifically, and summarize some methodological and statistical considerations relevant to these designs.

December 2

Jonathan Luu
Doctoral Student, Department of Biostatistics, Harvard University

"CONOS: Clustering on Network of Samples (Journal Club)"
ABSTRACT: Single-cell RNA sequencing (scRNA-seq) is a powerful approach to learning about cancer at the resolution of individual cells. However, many scRNA-seq studies often include multiple individuals, conditions, and tissues which present technical and conceptual analysis challenges. Furthermore, recent consortium efforts have been introduced to generate atlases of these single-cell datasets with thousands of systematically different samples. Recent alignment methods, although flexible, were designed for relatively small samples. CONOS presents a graphical approach to digest these large heterogeneous samples, identify recurrent cell populations, and analyze the relationships between cells in different samples.

December 9

Alejandro Reyes
Research Fellow, Department of Biostatistics, Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health

"A Survey of Genome Topology in Colorectal Cancer Reveals Large-scale Compartmental Changes that Restrain Malignant Progression"
ABSTRACT: Widespread changes to DNA methylation and histone modifications are well documented in cancer, but the fate of higher-order chromosomal structure remains obscure. Here we integrated topological maps for colon tumors and normal colons with epigenetic and transcriptional data to characterize alterations to chromatin loops, topologically-associated domains and large-scale compartments. Tumors exhibit profound compartmental reorganization, losing the normal spatial partitioning between the canonical open and closed genome compartments. This reorganization is accompanied by compartment-specific DNA hypomethylation and chromatin state changes. We also identified a novel compartment at the interface between canonical compartments distinguished by unique chromatin state and tumor-associated changes. Remarkably, the compartmental shifts are actually shared features of cells that have accumulated excess divisions. They likely restrain malignant progression by repressing genes linked to stem cell proliferation and invasion, while inducing anti-tumor immunity genes. Our findings call into question the conventional view that tumor-associated epigenomic changes are primarily oncogenic.

December 16

Gopal Kotecha
Doctoral Student, Department of Biostatistics, Harvard University

"Uncertainty Directed Factorial Clinical Trials"
ABSTRACT: Modern clinical medicine has seen an increase in the delivery of combinations of treatments, but many of these combinations and potential alternatives have not been evaluated in a trial setting. To address this evaluation gap, we introduce a bayes-adaptive trial design for the factorial setting. This class of trial designs uses a decision theoretic framework to randomize patients to treatments. Treatment assignment is carried out with the aim of maximizing expected utility in order to determine the best treatment combination from a large space of potential treatment combinations at the end of the trial. We use Bayesian models to model the data and explicit information metrics tailored to the problem to provide accuracy measures of the final selection of optimal treatment combinations. The proposed approach and the simulated scenarios used in the evaluation of these factorial BUD designs are motivated by a pragmatic trial at our institution that uses electronic health record based decision support to reduce inappropriate prescribing for older adults.

February 3

Eric Cohn
Doctoral Student, Department of Biostatistics, Harvard University

"Micro-Randomized Controlled Trials and Cancer Research"
ABSTRACT: The micro-randomized controlled trial (micro-RCT) is an experimental design used to develop and evaluate just-in-time, adaptive interventions. These designs have diverse applications—from rapid-cycle evaluations of mobile health interventions to optimizing cellphone-administered surveys to maximize response rates—many of which are relevant to cancer research. In this talk, I will present an overview of the micro-RCT design, motivate its use in public health and cancer research more specifically, and summarize some methodological and statistical considerations relevant to these designs.

February 10

Margaux Hujoel
Doctoral Student, Department of Biostatistics, Harvard University

"Identifiability of a Cancer-resistance Genotypes"
ABSTRACT: There is an open question as to whether cancer-resistant genotypes exist (Klein 2009 PNAS). Although individuals who don't develop cancer may lack mutations that make them susceptible, there is an alternative hypothesis that some of these individuals may in fact have genotypes that make them resistant to cancer. Rather than viewing cancer resistance as a "mirror image" of susceptibility, we aim to study cancer resistance as a distinct entity. Through simulation studies, we try to address two main questions. If a small group of individuals were resistant to cancer, would we be able to identify that such a group existed? And if so, could we estimate the prevalence of this resistance genotype?

February 24

Daniel Li
Doctoral Student, Department of Biostatistics, Harvard University

"Regularized Best Subset Selection: Concepts for Polygenic Risk Scores and Prediction"
ABSTRACT: We will talk about recent developments with regularized best subset selection. We will go over empirical and theoretical results, and discuss important ideas and concepts for applications with a focus on polygenic risk scores.

March 2

Cathy Wang
Doctoral Student, Department of Biostatistics, Harvard University

"Multi-Study Semi-Supervised Learning: A First Look"
ABSTRACT: In many machine learning applications, the gold standard label is difficult or expensive to obtain. Semi-supervised learning (SSL) methods leverage unlabeled data to improve a model’s performance when only limited labeled data is available. We investigate the replicability of the performance of SSL classifiers in a multi-study setting.

Going forward, starting with the March 23 meeting, please download and import the following iCalendar (.ics) files to your calendar system.

Weekly: https://harvard.zoom.us/meeting/uJMoc--qqjwvdL4F-POulE_jlE5PLZqL6w/ics?icsToken=98tyKuyvqz8sGNCStVz9f6kqW8H8b_H2lHVi_oUQrDDwDwVsaA_TY9JuCKNTRs-B
If you wish to join the meeting, you may request a passcode from the working group organizer, Dr. Jill Lundell (contact information above).

Join Zoom meeting
https://harvard.zoom.us/j/575462475

Join by telephone (use either number to dial in)
+1 929 436 2866
+1 669 900 6833

International numbers available: https://harvard.zoom.us/u/abmuvT28s

One tap mobile: +19294362866,,575462475# US (New York)

Join by SIP conference room system
Meeting ID: 575 462 475
575462475@zoomcrc.com

March 23

Amy Zhou
Doctoral Student, Department of Biostatistics, Harvard University

"Dirichlet Process Mixture Modeling"
ABSTRACT: The Dirichlet process mixture modeling is a Bayesian nonparametric method. In this talk, I will present an overview of the DPMM, and discuss its application to clustering problems in cancer research.

March 30

Maya Ramchandran
Doctoral Student, Department of Biostatistics, Harvard University

"A Clustered Cross-validation Weighted Generalization of Random Forest"
ABSTRACT: This project considers extending the general multi-study ensembling framework proposed by Parmigiani and Patel (2017) to single datasets, with Random Forest as the learner. Specifically, we look at using clustering algorithms to first split a dataset into regions that maximize feature-effect heterogeneity across clusters and minimize it within. We then train Random Forest learners on each cluster, and then ensemble them using replicability weights. We then explore in which settings this method outperforms training a single Forest on the full dataset and how it compares to learners trained using the oracle clusters.

April 6

Jill Lundell, Ph.D.
Research Fellow in Biostatistics, Harvard T.H. Chan School of Public Health

"Using Wavelets to Find Trends in Genetic Data"
ABSTRACT: Wavelets provide a way to look for different types of signals in all types of data. I have been investigating the ability of wavelets to automatically detect changes in methylation and GWAS data. Wavelets require that data are evenly spaced, which is not the case for genetic data. I am looking at established and new methods for modifying wavelets to handle the unequally spaced genetic data and identify both broad and short patterns.

April 13

Eric Dunipace
Doctoral Student, Harvard University

"Hamiltonian Paths for Multivariate Multisample Testing"
ABSTRACT: The need to test whether two or more distributions are different from each other arises in many statistical fields. In the univariate setting, several non-parametric tests exist such as the Kolmogorov-Smirnov test, Wilcoxon rank sum test, and the Wald-Wolfowitz runs test, among others; however, fewer options exist in the multivariate setting. We explore the use of Hamiltonian paths to construct one-dimensional curves that allow us to use these non-parametric tests as compared to some other methods. Early results indicate that tests on Hamilton paths do a great job of detecting scale changes in high dimensions, but perform more poorly with location changes.

April 20

Gopal Kotecha
Doctoral Student, Harvard University

"A Bayesian Uncertainty Directed Factorial Design"
ABSTRACT: Modern clinical medicine has many scenarios where there are multiple treatments for the same indication. When these treatments can be delivered together, the large number of potential treatment combinations means that it can be difficult to learn about them all. Often the primary quantity of interest is the best treatment combination for a given patient population. To address this evaluation gap, we introduce a Bayes-adaptive trial design for the factorial setting. While traditional factorial designs balance the assignment probabilities to each arm, our design uses a decision theoretic framework to adjust the probabilities with which patients are randomized to treatment combination arms. Treatment assignment is carried out with the aim of maximizing expected utility at the end of the trial, according to some predefined utility function. We model the data with a Bayesian model and define a map from the model to the randomization probability. We discuss potential choices for utility functions and their resulting trials, as well as the computational approximations required to carry out the trial.

April 27

Jonathan Luu
Doctoral Student, Harvard University

"A Framework for the Continual Reassessment Method (Journal Club)"
ABSTRACT: The continual reassessment method (CRM) is a model-based design for phase I clinical trials which aims to find the maximum tolerated dose (MTD) of a new therapy. The CRM has been shown to be more accurate in targeting the MTD than traditional rule-based approaches, such as the 3+3 design which is used in a majority of phase I trials. Furthermore, the CRM has been shown to assign more trial participants closer to the MTD. However, the CRM’s uptake in clinical research has been slow, putting drug development and patients at risk. The authors of this paper have developed a framework and collected resources, aimed at clinicians and statisticians new to the CRM design, to improve the uptake of the CRM in phase I dose-finding trials.

May 4

Peter Park, Ph.D.
Professor of Biomedical Informatics, Harvard Medical School

"Mutational Signature Analysis and its Application to the Clinic"
ABSTRACT: Different mutational processes operative in cancer and other diseases leave distinct 'signatures' in the DNA. Mutational signature analysis is an attempt to deconvolvethe mutational patterns from cancer sequencing data to better identify the factors that gave rise to cancer. Whereas previous work required a large amount of signal as found in exome and genome sequencing data, our new method SigMAenables accurate detection of mutational signatures even with >100-fold reduction in data size. This allows us to extend signature analysis to gene panels, the common platform used to profile tens of thousands of cancer patients each year. I will describe the methodology behind SigMAand how it can be used to identify patients with deficiency in the homologous recombination DNA repair pathway who should be considered for treatment with PARP inhibitors. I will also describe other projects in my laboratory.

May 11

To Be Announced


"Talk Title TBD"
ABSTRACT: None Given



Back to SPH Biostatistics Maintained by the Biostatistics Webmaster
Last Update: April 29, 2020