Junwei Lu, Harvard University

Statistics Software

CombInference: Package for combinatorial inference

The CombInference package implementation of the methods proposed in the paper is designed as a tool for selecting graph features from high-dimensional graphical models, specifically focusing on controlling the false discovery rate (FDR). This package applies methods like the K-dimensional Persistent Homology Adaptive Selection (KHAN) algorithm to select significant topological features across various fields, such as biology, chemistry, neuroscience, and sociology.

Graph Feature Selection: The package enables users to detect significant substructures in graphs (such as cliques, loops, and hubs) and assess uncertainty using FDR control. It is particularly useful for applications like protein substructure analysis, brain network connectivity, and social network analysis.
Topological Feature Detection: Utilizing persistent homology, the package identifies topological features such as cycles and cliques that persist across multiple scales, providing insights into network connectivity and behavior under various conditions.
Adaptive Screening and Selection: The KHAN algorithm within the package adaptively selects important topological features across continuous filtration levels, ensuring both computational efficiency and strong statistical performance.
Application to Biological Data: This package can be applied to real-world datasets, like SARS-CoV-2 protein structure analysis, helping researchers identify potential target residues critical for protein binding processes.

This package is an ideal tool for researchers and data scientists working with high-dimensional graphical models and networks, allowing them to explore and quantify significant structural features with controlled false discovery rates.

RankInference: Package for ranking inference

The RankInference package is designed to implement nonparametric inferential methods for ranking large language models (LLMs) based on pairwise comparisons. It is built on the extended Bradley-Terry-Luce (BTL) model and provides a robust framework for testing hypotheses and constructing confidence intervals of LLM rankings across different contexts. This package facilitates reliable model comparisons, particularly in trust-sensitive fields such as medicine, law, and finance.

Nonparametric Ranking Inference: The package applies a nonparametric ranking model that adjusts the ranking based on the context of the prompt, ensuring accuracy across domain-specific scenarios.
Pairwise and Top-K Hypothesis Testing: Users can test pairwise rankings between LLMs or determine whether a model is among the top-K models in a given domain, allowing for enhanced decision-making in selecting reliable models.
Confidence Diagrams: The package introduces a novel concept of confidence diagrams based on Hasse diagrams, which provide global insights into the hierarchical relationships between LLMs, visualizing the performance range and ranking variability.
Robust Bootstrap Confidence Intervals: The package uses Gaussian multiplier bootstrap methods to compute confidence intervals for ranking models, ensuring reproducible and trustable results across applications.

This package is ideal for researchers, data scientists, and practitioners working with LLMs who need reliable methods for model comparison and ranking in dynamic, real-world scenarios, especially where trust and accuracy are paramount.

DIANE: Package for distributed non-convex optimization

The DIANE R package is an implementation of the methods proposed in the paper for learning feature embeddings from multi-institutional Electronic Health Records (EHR) data. The package is designed to overcome the challenges of non-convexity and communication efficiency in distributed datasets. DIANE utilizes a low-rank Ising graphical model with a non-convex bi-factored surrogate loss to estimate knowledge graphs and embeddings from large-scale binary data.

Efficient Distributed Learning: The package enables computationally efficient and privacy-preserving learning from EHR data distributed across multiple institutions, using a one-shot communication process to transfer gradients between sites.
Low-Rank Embedding Estimation: The DIANE algorithm applies a low-rank Ising model to estimate semantic embeddings for high-dimensional binary data. This is particularly useful for applications like patient phenotyping and clinical feature dependency analysis.
Federated Algorithm: DIANE ensures data security by computing gradients locally at each healthcare database, and only transmits aggregated gradient information to a central master site, preserving patient privacy while achieving robust feature embedding estimation.
Applications to Biomedical Data: The package can be used to construct knowledge graphs, detect relationships between clinical features, and perform predictive modeling using large-scale EHR data, such as for neurodegenerative disease phenotyping.

This package is ideal for researchers and data scientists working with large-scale, distributed EHR data, offering powerful methods for feature extraction, knowledge graph construction, and risk prediction while maintaining computational and communication efficiency.

FederatedRL: Package for federated reinforcement learning

The FederatedRL R package implements distributed reinforcement learning algorithms designed for multi-institutional datasets, focusing on optimizing dynamic treatment regimes (DTRs) in healthcare settings. The package is tailored for environments with privacy constraints and distributed data sources, offering efficient federated learning solutions.

Federated Reinforcement Learning: FederatedRL supports offline reinforcement learning (RL) by allowing co-training across multiple institutions while preserving privacy. It applies the Markov Decision Process (MDP) model to personalize treatment strategies across different data sources without directly sharing patient-level data.
Communication-Efficient Algorithm: The package implements communication-efficient algorithms that reduce the need for frequent communication between institutions, facilitating privacy-preserving learning by exchanging summary statistics rather than raw data.
Multi-Institutional Datasets: FederatedRL is optimized for healthcare datasets like electronic health records (EHRs) distributed across different medical institutions. It handles heterogeneity in treatment policies and patient populations, ensuring robust and accurate policy optimization.
Applications in Healthcare: The package is ideal for clinical decision-making, such as optimizing sepsis management strategies or personalized treatments, where data from multiple sites can improve the quality of learned policies while maintaining privacy.

FederatedRL is designed for researchers and practitioners in healthcare and other domains requiring distributed reinforcement learning. It offers powerful tools for creating dynamic treatment regimes in environments where privacy and efficiency are essential.

Bioinformatics Software

ARCH: Package for intergrating codified and narrative data from electronic health records

The ARCH R package implements methods for generating large-scale knowledge graphs (KG) by analyzing codified and narrative data from electronic health records (EHR). This package is designed to address challenges in EHR analysis by extracting meaningful relationships between clinical concepts, represented as codified data and unstructured narrative notes, using advanced representation learning techniques. It combines natural language processing (NLP) and codified data to create comprehensive, low-dimensional embeddings of clinical features with statistical certainty.

Knowledge Graph Construction: The ARCH package builds sparse, large-scale knowledge graphs from EHR data, capturing relationships between codified and narrative clinical features while ensuring statistical accuracy through uncertainty quantification.
Codified and NLP Data Integration: ARCH effectively integrates both codified and unstructured narrative data, allowing users to analyze vast amounts of EHR data for complex clinical tasks such as drug side-effect detection, disease phenotyping, and clinical feature dependency analysis.
Feature Representation and Embeddings: Using low-dimensional semantic embeddings, ARCH captures relationships between over 60,000 EHR concepts, improving predictive modeling and disease classification tasks by enhancing weakly supervised phenotyping algorithms.
Applications in EHR Analysis: The package enables users to identify known relationships between clinical features, predict adverse drug reactions, and phenotype diseases such as Alzheimer's. By leveraging both codified and narrative data, ARCH improves the accuracy of clinical predictions and patient subgroup identification.

The ARCH R package is particularly valuable for researchers and healthcare analysts working with large-scale, multi-institutional EHR datasets. It enhances data representation, feature extraction, and knowledge discovery, making it an essential tool for clinical decision-making and biomedical research.

hyperbolicEHR: Package for learning hyperbolic embeddings for electronic health records

The HyperbolicEHR package implements a multi-source hierarchical clustering algorithm specifically designed to process large-scale electronic health records (EHR) data. This package addresses the complexity and lack of organization often encountered in EHR datasets by leveraging hyperbolic geometry to construct efficient hierarchical structures. By integrating multiple sources of codified and narrative medical data, the HyperbolicEHR package enhances the analysis and interpretability of medical codes such as diagnoses, medications, and laboratory results.

Multi-Source Data Integration: HyperbolicEHR utilizes neural network-parameterized optimal transport to unify diverse medical data sources, such as PheCodes, RxNorm, and LOINC, as well as local codes. This integration ensures that heterogeneous data sources are harmonized, providing a more complete and robust clinical dataset for analysis.
Hierarchical Clustering with Hyperbolic Geometry: The package applies hyperbolic geometry, which is well-suited for representing hierarchical relationships, to model complex hierarchical structures within EHR data. This method captures the inherent relationships between medical codes more efficiently than traditional Euclidean space models.
Automated Hierarchy Generation: HyperbolicEHR automatically generates hierarchical structures from the integrated EHR data, facilitating improved organization, clustering, classification, and predictive modeling tasks. These hierarchies can reveal important co-occurrence patterns and clinical relationships that are critical for decision-making in healthcare.
Enhanced EHR Analysis: By applying the HyperbolicEHR package, healthcare providers and researchers can navigate the complexities of EHR data more efficiently, identifying trends and relationships across medical concepts. This leads to improved clinical decision-making, patient care, and research outcomes.

The HyperbolicEHR R package is ideal for researchers and healthcare analysts seeking to leverage hierarchical clustering methods for large, complex EHR datasets. It provides a comprehensive approach to organizing and analyzing clinical data, enabling more accurate and actionable insights in healthcare applications.

MedArena: Comparing different medical large language models

The MedArena provides a comprehensive framework for evaluating the performance of large language models (LLMs) specifically in the medical domain. This package is designed to test models' understanding and interpretation of medical concepts, terminology, and contextual relationships that are critical for accurate decision-making in healthcare.

Medical Knowledge Evaluation: MedArena tests models on their ability to accurately interpret medical language, including diagnoses, treatment options, drug interactions, and medical procedures. It provides evaluation metrics tailored to the healthcare industry, ensuring models perform well in trust-sensitive environments such as clinical decision support systems.
Contextual Testing for Medical Scenarios: The package evaluates models' performance in real-world medical contexts, including diagnosis-based scenarios, patient history assessments, and complex multi-disease cases. This allows researchers and practitioners to assess model reliability in clinical decision-making.
Comparison Across Multiple LLMs: MedArena facilitates the comparison of different LLMs, enabling users to benchmark various models against each other for medical tasks such as symptom analysis, drug recommendations, and treatment protocols. This ensures that the most effective models are identified for healthcare applications.
Domain-Specific Metrics: MedArena offers specialized evaluation metrics tailored to healthcare use cases, such as precision, recall, and F1 scores focused on clinical relevance and patient safety. The package also supports model evaluation in multilingual medical contexts, accommodating diverse healthcare environments.

The LLM Arena R package is a valuable tool for healthcare researchers, data scientists, and clinicians aiming to evaluate and compare the performance of large language models in medical applications. By providing a robust testing environment and domain-specific metrics, it ensures that models are accurate, reliable, and effective for real-world healthcare use cases.

BONME: Package for intergating multi-institutional electronic health records

The BONME R package implements the Block-wise Overlapping Noisy Matrix Embedding (BONME) method, designed for integrating and analyzing multi-source electronic health record (EHR) data. This package provides a robust framework for handling block-wise missingness in data, ensuring efficient completion of missing submatrices and enabling comprehensive EHR analysis across multiple sources.

Multi-Source Data Integration: BONME addresses the challenge of combining multiple, heterogeneous EHR datasets, where data may have overlapping but non-identical entries. It integrates diverse medical data sources (e.g., from different hospitals or countries) to create a unified dataset for further analysis.
Block-wise Matrix Completion: The package excels in completing block-wise missing data patterns common in multi-institutional EHR datasets. It uses low-rank matrix completion techniques to estimate missing blocks, leveraging overlapping data between different sources to fill in the gaps efficiently.
Noise-Resilient Algorithms: BONME incorporates noise-handling mechanisms to ensure accurate matrix completion even in the presence of noisy or corrupted data entries, which is typical in large-scale EHR data collected from different healthcare systems.
EHR Data Integration: By integrating EHR data from different institutions and formats, BONME helps healthcare researchers and analysts construct comprehensive knowledge graphs or disease networks. It allows for the identification of relationships between medical concepts, improving disease phenotyping and prediction tasks.

The BONME R package is an invaluable tool for healthcare researchers, data scientists, and institutions that require robust methods to integrate and analyze large, multi-source EHR datasets. It helps uncover hidden relationships between clinical features and enhances the overall quality of healthcare analytics.