Running LLMs on HPC for Clinical Symptom Extraction

This project demonstrates a scalable workflow for generating patient-reported symptom lists for medical conditions using open Large Language Models (LLMs) on the Vlaams Supercomputer (VSC).

Research Objective

For a curated set of diseases (ICD codes) sourced from the MIMIC-IV clinical database, we systematically prompt open LLMs to produce exactly 10 common symptoms per condition. The outputs are saved in structured formats (JSON/Parquet) for downstream clinical data science.

The workflow illustrates how to:

Set up and work efficiently on HPC (High-Performance Computing).
Understand credits and resource allocation on VSC.
Configure and prepare the compute environment.
Download and manage open-source LLMs for clinical tasks.
Orchestrate compute jobs using SLURM (including ready-to-use script templates)
Process LLM prompts and collect results for downstream analysis.
Apply best practices for running large-scale jobs on HPC

Data Source: MIMIC-IV

MIMIC-IV (Medical Information Mart for Intensive Care IV) is a publicly available, de-identified database containing detailed records from ICU patients, including: - Clinical notes and documentation - Diagnostic codes (ICD-9/ICD-10) - Laboratory and vital sign measurements - Medication and procedure records

Data Access Requirements

In order to access the data the following have to be fulfilled:

Complete CITI training (or equivalent human subjects research training)
Obtain institutional approval through PhysioNet
Sign and comply with the official MIMIC-IV Data Use Agreement

Note:
In this workflow, only disease names and ICD codes were submitted to LLMs. No patient-level or identifying data was used or transmitted.