Running LLMs on HPC for Clinical Symptom Extraction
This project demonstrates a scalable workflow for generating patient-reported symptom lists for medical conditions using open Large Language Models (LLMs) on the Vlaams Supercomputer (VSC).
Research Objective
For a curated set of diseases (ICD codes) sourced from the MIMIC-IV clinical database, we systematically prompt open LLMs to produce exactly 10 common symptoms per condition. The outputs are saved in structured formats (JSON/Parquet) for downstream clinical data science.
The workflow illustrates how to:
- Set up and work efficiently on HPC (High-Performance Computing).
- Understand credits and resource allocation on VSC.
- Configure and prepare the compute environment.
- Download and manage open-source LLMs for clinical tasks.
- Orchestrate compute jobs using SLURM (including ready-to-use script templates)
- Process LLM prompts and collect results for downstream analysis.
- Apply best practices for running large-scale jobs on HPC
Data Source: MIMIC-IV
MIMIC-IV (Medical Information Mart for Intensive Care IV) is a publicly available, de-identified database containing detailed records from ICU patients, including: - Clinical notes and documentation - Diagnostic codes (ICD-9/ICD-10) - Laboratory and vital sign measurements - Medication and procedure records
Data Access Requirements
In order to access the data the following have to be fulfilled:
- Complete CITI training (or equivalent human subjects research training)
- Obtain institutional approval through PhysioNet
- Sign and comply with the official MIMIC-IV Data Use Agreement
Note:
In this workflow, only disease names and ICD codes were submitted to LLMs. No patient-level or identifying data was used or transmitted.