Computational and Data Science Research
Get the latest information about what’s happening with the CHOC Research Institute, delivered straight to your inbox!
Computational and Data Science is one of the key service cores in CHOC’s Research Institute.
We are computational scientists, data scientists and biostatisticians driven by the fulfillment of improving the health of children in our communities and across the United States.
We retrieve, analyze and interpret data to help doctors, nurses and administrators improve decision-making and clinical outcomes.
We build computational and data science models to predict undesirable clinical events before they occur and with sufficient time for intervention.
We conduct world-class research and publish findings in top-tier journals in medicine, statistics and data science.
We assist other CHOC research groups and individuals and support graduate students from local universities with data and computational tools for analyses of all modalities of data at CHOC.
We hold degrees across all academic levels and collaborate with academic and medical institutions across the US and abroad.
Contact us at [email protected].
The Computational and Data Science Research Team
Terence Sanger, PhD, MD – VP, Chief Scientific Officer
Terence Sanger, PhD, MD
VP, Chief Scientific Officer
Dr. Terence Sanger, MD, PhD holds an SM in Applied Mathematics (Harvard), PhD in Electrical Engineering and Computer Science (MIT), and MD (Harvard), with medical specialization in Child Neurology and Movement Disorders. He is currently Professor of Electrical Engineering at the University of California Irvine (UCI), Vice Chair of Research, Pediatrics, (UCI) Director of the Pediatric Movement Disorders Clinic and Deep Brain Stimulation Program at Children’s Hospital of Orange County (CHOC), and the Vice President, Chief Scientific Officer at CHOC. (Dr. Sanger is a member of CHOC’s medical staff and is tenured faculty in the department of pediatrics at UC Irvine).
Prior to CHOC, Dr. Sanger served as Provost Professor in the biomedical engineering, neurology and biokinesiology departments at the University of Southern California. He was an attending neurologist at Children’s Hospital Los Angeles, where he served as Director of the Pediatric Movement Disorders Program, the David Lee and Simon Ramo Chair in Health Sciences and Technology and the Founding Director of the Health, Technology and Engineering Program at The University of Southern California. Previously, he was a tenured Professor of child neurology at Stanford University and on medical staff at Lucille Packard Children’s Hospital.
Dr. Sanger’s research focuses on understanding the origins of pediatric movement disorders from both a biological and a computational perspective. The primary goal of his research is to discover new methods for treating children with disorders of developmental motor control, including dystonia, chorea, ataxia, spasticity, and dyspraxia. His research includes computational neuroscience and large-scale neural circuit modeling of basal ganglia and cerebellum, nonlinear signal processing, machine learning, and control theory applied to robot models of motor disorders, and processing of electrophysiological data from children with implanted electrodes. Ongoing research also includes the development of electromyography-controlled soft exoskeleton orthotics for assistance with upper limb movement in children with cerebral palsy.
Phuong Dao, JD – Executive Director, Research Operations
Phuong Dao, JD
Executive Director, Research Operations
Phuong Dao is the Executive Director of Research Operations for the CHOC Research Institute. Phuong has worked in healthcare for over 20 years, specializing in research administration, research compliance, and corporate compliance. In addition to her current position, Phuong previously served as the Corporate Integrity Officer for the Seattle Cancer Care Alliance (SCCA) in Seattle WA and the Research Development and Operations Manager in the Department of Medicine at UC Irvine. Phuong earned a J.D. from Seattle University and a B.A. in Biochemistry from Smith College.
Louis Ehwerhemuepha, PhD – Director, Computational Research
Louis Ehwerhemuepha, PhD
Director, Computational Research
Louis Ehwerhemuepha, PhD is the Director of the Research Computational Science (Computational Research) team at CHOC under the executive leadership of Phuong Dao, JD and Terence Sanger, MD, PhD. Louis leads the research data science program at CHOC focusing on a wide breath of computational and data science research from the application of applied statistical learning on structured electronic medical records to the application of deep learning for computer vision and natural language processing in pediatric medicine. The team he leads consists of members with different specializations in Data Science and Statistics and provide research data services to other researchers at CHOC. His applied research is varied and encompasses hospital readmission, sepsis, COVID-19 and multisystem inflammatory syndrome in children (MIS-C), artificial intelligence for rare diseases, population health management, and management of pediatric complex chronic conditions such as neurological and cardiovascular chronic conditions. He has led deployment of various statistical and machine learning models in the electronic medical record (EMR) resulting in improved quality of care. He collaborates closely with the UCI faculty in Statistics and Data Science to advance Data Science research at CHOC and UCI.
Peyman Kassani, PhD – Senior Research Computational Scientist
Peyman Kassani, PhD
Senior Research Computational Scientist
Peyman H. Kassani Holds a B.Sc. in applied mathematics and a M.Sc. in computer science. Peyman obtained his doctoral degree at Yonsei University in Seoul, South Korea in computer vision and pattern recognition. He received two years postdoc training at Tulane University in computational neuroimaging with causal discovery and two years postdoc training at Stanford University in computational genomics with explainable deep neural networks. He is currently with Children hospital of orange county and also collaborates closely with UCI faculty in statistics and data science. The main focus of his research at CHOC will be on computational neuroscience, (imaging and genomics) through different modalities in children. His research interests include causal discovery, explainable deep neural networks, and sparse learning.
Chloe Martin-King, PhD – Research Computational Scientist II (Imaging)
Chloe Martin-King, PhD
Research Computational Scientist II (Imaging)
Chloe Martin-King has a PhD in Computational and Data Sciences from Chapman University, Orange, CA. Her scholarly work centers on digital image restoration with both traditional image processing techniques and computer vision using deep learning. She spent a year as a postdoctoral scholar in the Radiomics Lab at USC before joining the Research Computational Science team at CHOC in 2021. She enjoys utilizing her image processing and deep learning background to address clinically relevant research pursuits. She has developed models for various medical tasks; automatic detection, segmentation, and classification of disease from several modalities including whole slide imaging, x-ray, and CT.
Tatiana Moreno, BS – Research Computational Scientist I (NLP)
Tatiana Moreno, BS
Research Computational Scientist I (NLP)
Tatiana Moreno is a graduate from the University of California, Santa Barbara in Statistics and Data Science. She worked as a Student Computing (IT) Assistant for the Kavli Institute of Theoretical Physics while in Santa Barbara. After graduating in 2019, she has worked as a Data Science Intern, Data Support Assistant and now Research Computational Scientist I (RCS I) while at CHOC with a publication in Nature for developing a model for targeting high risk trauma pediatric patients for prolonged hospital length of stay as well as a publication for a super learner ensemble of 14 statistical learning models for predicting COVID-19 severity among patients with cardiovascular conditions. Some of her main research pursuits in her RCS I position at CHOC involve the assessment of unplanned hospital readmission (UHR) in CHOC’s neonatal population, examining risk factors of adverse childhood experience (ACE) scores and association with healthcare utilization, and investigating changes in inpatient mental health presentation pre- versus post-COVID-19 using NLP on clinical notes.
Ricardo Aguilar, MS – Research Computational Scientist I (Biostatistics)
Ricardo Aguilar, MS
Research Computational Scientist I (Biostatistics)
Ricardo Aguilar is a graduate from the University of California, Los Angeles with a MS degree in Biostatistics. His academic research focused on identifying interaction terms in correlated, high-dimensional data using a feature selection machine learning algorithm. He has 5 years of experience as a statistical consultant where he gained experience in various areas of research such as life sciences, community health sciences, education, and agriculture. He is now a Research Computational Scientist I at CHOC where he supports research through inferential statistics, statistical and machine learning model building, and study design. Additionally, he catalogs and investigates databases to expand the scope of research conducted at CHOC.
Quinn Gates, MS – Research Computational Scientist, Grant-Funded
Quinn Gates, MS
Research Computational Scientist, Grant-Funded
Quinn Gates is a PhD candidate at Chapman University where he is pursuing a degree in Computational and Data Sciences. He has a Bachelors degree in Mathematics and a Masters degree in Computational and Data Sciences also from Chapman University. He has taught college level calculus and has led multiple research teams exploring topics like the effects of monetary rewards for exercise and the use of specialized neural networks to identify and classify cell types in colorectal histological whole slide images. Currently, he is an associate data scientist working with CHOC Hospital to study the effects that sociodemographic factors play on asthmatic adolescents.
Current Graduate/Medical Student Researchers
- Allyson McDaniel, BSN, RN, CDCES
- Rachel Sanchez, BSN, RN, CDCES
- Reyna Gamboa-Perez, BSN, RN, CDCES
- Lisa Jenson, BSN, RN, CDCES
- Samantha Thompson, RD, CDCES
- Rebeca Quintana, BSN, RN, CDCES
Past Graduate/Medical Student Researchers
- Leah Blalock, MS, RD, CDCES
- Claire Vercammen, BSN, RN, CDCES
- Brenda Amador-Rivera, BSN, RN, CDCES
- Megan Huang, MSN, RN
- Katia Torosian, BSN, RN, CDCES
Key Collaborators
- Amrit Bhangoo
- Anthony Chang
- Bill Feaster
- Carol Davis-Dao
- Charles Golden
- Christina Reh
- Christine Chou
- Dan Cooper
- Daniel Shrey
- Stanley Galant
- Donald Philips
- Heather Huszti
- Hollie Lai
- Jessica Brown
- Joan Devin
- Julian Thomas
- Kathy Huen
- Kenneth Grant
Pediatric and Lifespan Data Science Conference
April 10, 2025 – April 11, 2025
Our History
Terence Sanger, MD, PhD, and Phuong Dao, JD, founded the Research Computational Science Program and developed it alongside Louis Ehwerhemuepha, PhD, in August 2021.
Dr. Ehwerhemuepha led development of a CHOC Data Science Program between June 2015 and July 2021. He supported Drs. Bill Feaster and Anthony Chang during that period and continues doing so with expanded resources from the Research Institute.
Retrospective statistical analyses of medical data
What happened? Why did it happen? How can we prevent it from happening again if undesirable?
COVID-19
We are supporting multiple providers within CHOC as well as collaborations with researchers across other institutions in the US including researchers from the CDC. Themes of research questions we address include:
- Predisposition to severe disease
- Prediction of post-acute sequalae of SARS-CoV-2 (PASC), also known as long COVID
- Exacerbation of preexisting conditions post COVID-19
- Prediction of organ dysfunction/failure
- Multisystem inflammatory condition in children (MIS-C)
- Special focus on selected at-risk populations (Asthma, Cardiology, Cystic Fibrosis and Sickle Cell Disease)
Adverse Childhood Experiences (ACEs)
Supporting the CHOC-UCI research initiative in Public Health led by multiple specialties at CHOC and faculty from UCI School of Public Health. Goal is to examine adverse childhood experiences (ACEs) in Southern California and risk factors of exacerbation as measured by healthcare utilization. Cohort includes patients assessed for ACE in the Emergency Department and Primary Care Clinics at CHOC.
Epilepsy (predictors of intractable epilepsy)
To predict patients who will develop intractable epilepsy three months before its development. Knowing which patients are likely to develop intractable epilepsy will allow physicians and patients to develop an appropriate care plan. The model will be fitted using CHOC patient data (demographic, medical history and clinical notes).
Rare diseases
Juvenile Dermatomyositis (JDM) is a rare multifaceted autoimmune disease that usually presents with a characteristic rash and symmetrical proximal muscle weakness and may impact nailfold capillary end row loops (ERL). In this series of studies, we are identifying serological markers of disease activity as well as developing models for predicting disease course after initial clinic consultation.
Clinical nutrition
Dietary diversity: Supporting the Clinical Nutrition and Lactation team to assess dietary intake, oral supplement dependence, and dietary diversity among CHOC patients with pediatric feeding disorders. The findings from this study will be used to improve dietary recommendations for patients with feeding disorders to reduce their frequency of nutrient inadequacies CHOC patient dietary and demographic data were modeled using quasi-poisson regression to investigate how these characteristics affect nutrient inadequacies.
Healthcare utilization
- Mental health rehospitalization: development of models using structured data to predict risk of rehospitalizations that will drive outpatient mental health interventions.
- Rising risk model: Predicting which patients are likely to become a rising risk for increased resource utilization in the following year. The results will assist clinicians in identifying which patients will need additional resources within the next year. The data will consist of demographics, historical diagnoses and mental health characteristics of CHOC patients.
- ED return visits: Assessing potential novel improvements to the difficult prediction tasks of predicting ED return visits to optimize utilization and reduce waste.
Thyroid diseases and evaluation scales (individualized normal thresholds for TSH, T3 and T4)
Determining whether intra-subject variation of thyroid laboratory test levels is significantly smaller than the inter-subject variation. Findings from this study may be used to justify patient-specific ranges of appropriate laboratory test levels instead of population-wide ranges. This will be investigated through an analysis of variance components in a linear mixed model using parametric bootstrap.
Neuromuscular
Investigating whether Hispanic children with Duchenne muscular dystrophy treated with steroids have a higher body mass index than their non-Hispanic counterparts. This study aims to determine whether Hispanic patients should only be treated using medications with a lower risk of weight gain.
Neonatal readmission
What is the relationship between gestational age, day of life at NICU admission and risk of readmission? In this study, we assess how gestational age and Day of Life (DOL) on neonatal intensive care unit (NICU) admission modifies the risk of 30-day unplanned hospital readmission.
Asthma
To examine whether medical history (diagnoses, prescriptions, encounters, etc.) and emergency department encounter data (at triage) can be used to predict whether a patient will be admitted to the general floor, admitted to the ICU or discharged home during an ED visit for asthma exacerbation. Additionally, this study examines whether medical history and encounter data (from triage to discharge) can be used to predict whether patients will have a return visit within 14 days of being discharged home after an ED visit for asthma. This study aims to identify historical and at-encounter characteristics that may impact emergency department discharge disposition to improve patient outcomes.
Intensive care medicine
ICU late transfers and bounce backs: Assessing models to reduce late transfers to the ICU and premature discharge from the NICU. Goal is a real-time ICU model that informs providers on rapidly deteriorating patient conditions and likelihood for relapse after ICU treatment.
Machine learning prediction models
How can we learn using structured tabular electronic health records from the past to predict the future in real time and improve clinical outcomes?
Hospital readmission
Rebuilding and redeploying the hospital’s readmission model to address impact of COVID-19, distributional shifts and important sub-specialty patients including neonatal and mental health utilizations.
Rising risk model
Predicting which patients are likely to become a rising risk for increased resource utilization in the following year. Healthcare utilization is used as proxy for deterioration in health that may be preventable. Study involves analyses of confounding due to challenges with access to care.
Earliest warning system for sepsis
Updating and redeploying an ED triage model for predicting patients who are at risk for sepsis and may require early interventions.
Autism triage
Developing a model to determine whether a confirmatory or comprehensive autism evaluation is appropriate for a patient. This model will aid with triaging patients at the Thompson Autism and Neurodevelopmental Center.
Juvenile Dermatomyositis (JDM)
Patients undergoing treatment for JDM respond to treatment differently. In this study, we are developing models to predict early or late response to treatment to inform interventions that will increase the proportion of patients with early and sustained response to treatment.
CPAP Failure
Noninvasive oxygen therapy is preferred over more invasive procedures if it will suffice and for as long as it suffices. In this study, we are developing models to predict early needs to change oxygen therapy from CPAP to more invasive options. This includes determination that invasive therapy is appropriate from admission or predicting the optimal time to change from CPAP to ECMO or mechanical ventilators. Previous research has shown that late change in therapy can increase morbidity. However, unnecessary use of invasive oxygen therapy also increases morbidity.
Deep learning models and related artificial intelligence models
How can we learn from unstructured data (images, clinical notes, movies) to predict the future in real time and improve clinical outcomes?
Mental predictions
- Disparity of care due to SES: Determining whether there are disparities in mental health outcomes by socioeconomic status using unstructured clinical/provider notes. Socioeconomic status will be inferred from the notes as well as health insurance payer type.
- Disparity of care among LGBTQ+ patients: Examining health care disparities and non-healthcare factors among LGBTQ+ patients that may exacerbate the impact of COVID-19 on mental health. Deidentified copies of clinical notes will be analyzed to extract clinical entities relating to LGBTQ+ status, putative risk factors and other statistically identified factors that may impact outcomes or are confounders therein.
Steroid free remission (inflammatory bowel disease)
Primary objective is to predict steroid free remission at 12 and 52 weeks following pathology-confirmed diagnosis of ulcerative colitis or Crohn’s disease and using corresponding tissue pathology. In addition, multimodal networks will be trained on images tissue pathology, structured data and clinical notes on clinic visits and pathology. Findings will help inform treatment decisions and clinical practice guidelines for these patients.
Focal cortical dysplasia (FCD)
This study encompasses the detection and segmentation of abnormal lesions associated with FCD using MR imaging. FCD is the most common cause of intractable focal epilepsy in children, in which neurocognitive dysfunction and behavioral problems may also be present. The objective of this study is to improve pre-surgical detection and diagnosis of FCD from MRI using deep learning techniques. Furthermore, segmentation of regions of interest may guide surgical interventions as well as timing of surgery.
Rare diseases (Juvenile Dermatomyositis Nailfold Capillaroscopy Analyses)
- Predicting disease activity
- Detecting density of end-row capillary loop
Peripherally Inserted Central Catheter (PICC) line complications
Can we predict PICC line complications (such as infections) using cellphone images of the insertion site? Children with chronic/critical illnesses may require insertion of central venous catheters (CVC) for delivery of medications and other intravenous fluids. These CVC increase the risk of complications including central line-associated bloodstream infections (CLABSI) and may expose patients to increased morbidity and mortality. Research identifying risk factors of CVC complications and corresponding machine learning applications have depended on structured electronic health records of hospitalized patients. Our multidisciplinary team of clinicians and data scientists aim to extend the type of data for prediction of CVC complications to include images of PICC line insertion sites requiring deep convolutional neural networks. Expected outcomes include significant improvements in predicting in-hospital CLABSI and novel models/applications for at-home monitoring.
Context-aware clinical notes summarization
Providers can learn a lot about a new patient’s history through conversations with the patient and family. In this study, we will be applying existing algorithms as well as developing new ones for context-aware clinical notes summarization starting with sub-specialties and expanding to more general models. These summarizations will fill gaps inherent in verbal recollection of patient’s history and ensure provide get information in the context most helpful for treatment.
Empirical analyses and computational intelligence
How can we solve computational problems requiring new numerical solutions to difficult optimization models?
Small samples with high dimensions
Development of new statistical learning algorithms for learning from conditions (such as rare diseases) wherein sample sizes are small but data is high dimensional including genomic data.
Deployment of models
How can we build the computational and data science systems required to deploy real-time clinical models, integrate with EHR workflow, and develop corresponding intervention protocols?
- Hospital readmission (proven to reduce readmission rates and reduce healthcare expenditure)
- Rising risk (we predict who will need more care within 12 months)
- Sepsis (we developed the earliest warning system for sepsis at ED triage and are expanding to real-time predictions over the ED and any corresponding hospital stay)
- Autism (we are developing models to help with triaging patients requesting evaluation for autism)
The RCS team is charged with supporting all members of the organization interested in research. As a result, each team member is often tasked with managing multiple projects as primary responsibility. Estimated timelines will vary by workload.
- Conception and Ideation (2 – 8 weeks)
- An iterative process to reconcile the clinical and statistical significance of the study hypotheses or questions.
- If the investigator is uncertain of what to study, we engage in discussions and perform a literature review to determine what is important to the investigator and what is pertinent to improving the field.
- Once the investigator has chosen a question of interest, we work to refine ideas specific to the available data, and in consideration of statistical, machine learning or artificial intelligence (AI) model options.
- Typical timeline: Highly variable, 2 to 8 weeks.
- Completion of milestone: 1 to 2 sentences that clearly states a hypothesis or study question.
- Protocol development (4 – 8 weeks)
- Study design: The team will work with investigators to design the study, ensuring there is balance of both clinical and statistical significance.
- Protocol development: The RCS team will draft the Methods and Statistical Consideration sections of the IRB protocol for submission.
- Average timeline (without considering IRB review time): 4 to 8 weeks.
- Completion of milestone: IRB approval (or waiver).
- Data retrieval (Timeline is variable and depends on scope of study)
- Possibly the most time intensive part.
- The team will develop a data retrieval plan in collaboration with the investigator.
- Data retrieval and preprocessing.
- Internal validation of data.
- Summary statistics and clinical review of data with the clinical team.
- Average timeline: Variable and difficult to predict. The timeline will depend on data sources and scope of variables as well as potential for unexpected events during data retrieval. The RCS team will provide a timeline on a case-by-case basis.
- Completion of milestone: Summary statistics as appropriate for study.
- Data analyses (Timeline is variable and depends on scope of study)
- This involves appropriate statistical, machine learning and AI modeling constrained by appropriate clinical consideration, approved IRB protocol and study design.
- Average timeline: Variable and dependent on the number of hours spent per week on the project. This may involve some iterations. The RCS team will provide a timeline on a case-by-case basis.
- Completion of milestone: Adoption of statistical, machine learning or AI model.
- Manuscript assistance (2 – 4 weeks)
- The team will draft the Methods and Results sections while the investigator is responsible for drafting the Introduction, Discussions and other sections of the paper.
- Average timeline: 2 to 4 weeks.
- Completion of milestone: Draft of Methods and Results sections
- Peer review revisions
- The team will assist in peer review revisions that involve changes in models developed or study design as appropriate.
- Average timeline: Unpredictable.
- Closing of project or study
- This occurs after publication and/or presentation of results.
CHOC EMR Data
EMR data may be classified by their format. Structured data are tabular data that may fit into an SQL (Structured Query Language) table. Unstructured data involves clinical notes, other free-text data, images and movies.
Structured (tabular) data
The RCS team provides access to this type of data using instances of CHOC’s HealtheIntent database. All data including diagnosis codes, medications, laboratory test values, vital signs and discrete elements are captured from certain forms. There are other sources of data that the RCS team may use, as needed, to support your research.
Unstructured data
- PACS images: please reach out to [email protected] if you have research questions that require medical imaging.
- Clinical notes: please reach out to [email protected] if you have research questions that require automated natural language processing of clinical notes. We will let you know which type of research or information retrieval we can automate using natural language processing.
- Feel free to inquire about any other needs for CHOC clinical data for research
Deidentified Multicenter (Structured) EMR Data – Cerner Real World Data (CRWD)
This database consists of tabular data from more than 100 health systems in the United States. It does not include clinical notes, medical images or other forms of unstructured data. However, it contains data across all care settings on patient encounters, medications, diagnoses, laboratory tests, orders, procedures, allergy, vital signs, other clinical events and information on the deidentified health systems contributing to the database. The database includes more than
- 100 million patients (of which 20 million are children less than 18 years)
- 1.5 billion encounters
- Tens of billions of clinical data on these patients
Division specific databases, data sources and registries
The best approach for these data types is to reach out to us at [email protected] and we will connect with Research Coordinators of each division to request access to corresponding data required for research (if we do not already have access). There are many disease- or division-specific registries and external data sources. Here are some examples for Trauma and Psychology.
Trauma
- National Inpatient Sample 1988-2018 (PI must sign national HCUP DUA, state databases available but separate applications to state)
- National Emergency Department Sample 2006-2018 (PI must sign national HCUP DUA, state databases available but separate applications to state)
- National Trauma Data Bank 2007-2018 (National submission of protocol application/approval required by ACS COT)
- National Emergency Medical Services Information System 2017-2020 (National submission of protocol application/approval required by NHTSA/EMS)
- National Electronic Injury Surveillance System 2000-2020 (Publicly available by CPSC)
- National Violent Death Reporting System 2002-2018 (National submission of protocol application/approval required by CDC)
- Kids’ Inpatient Database 2012, 2016 (PI must sign national HCUP DUA, state databases available but separate applications to state)
Psychology
- National Survey on Drug Use and Health 2004-2011 (Publicly available by SAMHSA)
- Mental Health Client-Level Data 2013-2019 (Publicly available by SAMHSA)
- Treatment Episode Data Set: Admission 2001-2019 (Publicly available by SAMHSA)
- Treatment Episode Data Set: Discharges 2006-2019 (Publicly available by SAMHSA)