RuCCoD: Towards Automated ICD Coding in Russian

2502.21263

Aleksandr Nesterov et el.

🤗 2025-03-10

TL;DR
#

This study explores automating clinical coding in Russian, a language with limited biomedical resources. Existing research focuses on English datasets, leaving a gap in resources for other languages. Assigning ICD codes is crucial for medical documentation but poses challenges due to medical terminology, subjective interpretations, and changing classification standards. Errors in manual coding can lead to misdiagnosis and financial repercussions, highlighting the need for accurate and efficient coding methods.

To address these issues, this paper presents RuCCoD, a new dataset for ICD coding in Russian, labeled by medical professionals. It benchmarks models like BERT and LLaMA and examines transfer learning across domains. Experiments show that training with automatically predicted codes improves accuracy compared to manually annotated data. These findings highlight the potential for automating clinical coding in resource-limited languages and enhancing clinical efficiency and data accuracy.

Key Takeaways
#

Why does it matter?
#

This paper is important for researchers because it introduces RuCCoD, a novel dataset for ICD coding in Russian, addressing a gap in resources for low-resource languages. It also offers insights into the performance of various models, including LLMs, for automated clinical coding, and demonstrates the potential for improved accuracy in diagnosis prediction using AI-guided coding.

Visual Insights
#

🔼 Figure 1 showcases examples from the RuCCoD dataset, illustrating the annotation process for ICD codes. Each highlighted entity (in green) represents a medical concept extracted from a Russian electronic health record (EHR). Above each green entity is its corresponding ICD code, with an English translation provided in yellow. This visualization clarifies how annotators linked specific diagnostic information within EHR text to standard ICD codes.
read the caption
Figure 1: Examples of ICD code assignments by annotators: each entity in green is annotated with its ICD code above and its English translation (in yellow).

	Train	Test
# of records	3000	500
# of assigned entities	8769	1557
# of unique ICD codes	1455	548
Avg. # of codes per record	3	3

🔼 This table presents a summary of the statistics for the RuCCoD dataset used in the ICD coding task. It shows the number of records, the number of entities assigned ICD codes, the number of unique ICD codes present in the dataset, and the average number of codes per record, broken down for both the training and testing sets of the dataset. This provides a comprehensive overview of the size and complexity of the RuCCoD dataset.
read the caption
Table 1: Statistics for the RuCCoD training and testing sets on ICD coding of diagnosis.

In-depth insights
#

ICD Auto-Coding
#

ICD auto-coding aims to streamline medical record management by automatically assigning standardized codes to diagnoses. This is critical for billing, insurance, and research. The challenges include navigating complex medical terminology and maintaining accuracy amidst evolving classification standards. Automation can reduce human error and financial repercussions linked to manual coding errors, including financial losses. The process mirrors medical concept normalization, linking physician diagnoses to ICD codes via information extraction. Recent research focuses on neural networks to improve accuracy, but challenges remain, such as data scarcity in languages other than English, variability in clinical notes, and hierarchical ICD code structures. Advanced techniques like BERT, LLMs with PEFT, and RAG are being explored to tackle these hurdles. Further, automatic ICD coding requires integration of external medical knowledge sources like knowledge graphs which will facilitate LLMs generalization on rare codes.

RuCCoD Dataset
#

The RuCCoD dataset appears to be a novel resource for ICD coding in Russian, a low-resource language in the biomedical domain. It’s a significant contribution because most existing ICD coding datasets are in English, hindering research and development in other languages. The dataset’s creation aims to address the limitations of using UMLS for ICD coding in Russian, as UMLS may not fully capture the structured semantic requirements of ICD. RuCCoD’s manual annotation by medical professionals, focusing on ICD-10 CM concepts, ensures high-quality labels tailored to clinical practice. The dataset’s size, with over 3,000 records, and the annotation process emphasizing inter-annotator agreement, indicates a commitment to reliability. The use of RuCCoD as a benchmark for state-of-the-art models and its application in downstream tasks highlights its potential for advancing automated clinical coding in Russian and bridging the resource gap.

BERT vs LLAMA
#

BERT excels in nuanced language understanding through its bidirectional training, capturing contextual relationships effectively. It’s computationally intensive but yields high accuracy in various NLP tasks. LLaMA, a large language model, prioritizes efficient inference and generation. While it may not match BERT’s contextual depth, its speed and scalability make it suitable for real-time applications. The choice hinges on balancing accuracy needs with resource constraints.

EHR Improvement
#

EHR improvement through AI, as discussed in the paper, centers on enhancing data accuracy and utility. The study shows auto-labeling EHRs improves diagnostic model performance, surpassing doctor-assigned codes. This highlights AI’s potential to refine data entry, reduce errors, and standardize clinical info. Key is using AI to pre-train models on vast, auto-labeled datasets, aiding disease diagnosis. The research underscores AI’s role in advancing data quality for better healthcare outcomes and efficiency.

AI-Driven Assist.
#

AI-driven assistance in medical coding and diagnosis holds immense potential. By automating ICD coding, AI can reduce errors, enhance clinical efficiency, and improve data accuracy, especially in resource-limited languages. AI can provide independent opinions potentially beneficial in decision-making. Automating and providing better outcomes than manual data annotation by physicians proves the complexity of the ICD system for doctors and shows that AI can assist doctors in the diagnosis. AI helps address challenges like terminology navigation and classification standard updates. Furthermore, automating enables comprehensive analysis, thus aiding early disease identification.

More visual insights
#

More on tables

	Original Dataset	Linked Dataset	Manual Test Set
Number of records	865539	865539	494
Number of unique patients	164527	164527	450
Number of unique ICD codes	3546	3546	394
Avg. number of ICD codes per patient	$3\pm 2$	$5\pm 2$	$4\pm 2$
Avg. number of EHR records before current appointment	(15, 36, 73)	(15, 36, 73)	(17, 36, 77)
Avg. length of EHR records per one appointment	(77, 167, 316)	(77, 167, 316)	(86, 176, 320)
Patient’s age	(59, 67, 74)	(59, 67, 74)	(60, 67, 75)
Percentage of male patients	69	69	71

🔼 This table presents a statistical overview of the RuCCoD-DP dataset, which is used for diagnosis prediction. RuCCoD-DP is a collection of real-world electronic health records (EHRs) divided into training and testing sets. The table shows the number of records, unique patients, and unique ICD codes in each set. It also provides the average number of ICD codes per patient, the average number of EHR records per patient before the current appointment, the average length of EHR records per appointment, the average age of patients, and the percentage of male patients. The values in parentheses represent the 25th, 50th, and 75th percentiles, giving a clearer picture of the data distribution.
read the caption
Table 2: Statistics for the randomly split training and testing sets of RuCCoD-DP for diagnosis prediction. Values in brackets show the 25th, 50th, and 75th percentiles.

Supervised with various corpora for NER and EL
Model	Precision	Recall	F-score	Accuracy
BERT, NER: NEREL-BIO + RuCCoD, EL: RuCCoD	0.512	0.529	0.520	0.352
BERT, NER: RuCCoN + RuCCoD, EL: RuCCoD	0.471	0.543	0.504	0.337
BERT, NER: RuCCoD, EL: RuCCoD	0.510	0.542	0.525	0.356
LLM with RAG (zero-shot with dictionaries)
LLaMA3-8b-Instruct, NEREL-BIO	0.059	0.053	0.056	0.029
LLaMA3-8b-Instruct, RuCCoN	0.164	0.15	0.157	0.085
LLaMA3-8b-Instruct, ICD dict.	0.379	0.363	0.371	0.228
LLaMA3-8b-Instruct, ICD dict. + RuCCoD	0.465	0.451	0.458	0.297
LLM with tuning
Phi3_5_mini, ICD dict.	0.394	0.39	0.392	0.244
Phi3_5_mini, ICD dict. + RuCCoD	0.483	0.477	0.48	0.316
Phi3_5_mini, ICD dict. + BERGAMOT	0.454	0.448	0.451	0.291

🔼 This table presents the performance of different models on the entity-level code assignment task using the RuCCoD test set. The models evaluated include a BERT-based pipeline, LLMs with PEFT, and LLMs with RAG. The metrics reported are precision, recall, F1-score, and accuracy, allowing for a comprehensive comparison of model effectiveness. The ‘best’ performing model for each metric is highlighted in bold. Further experimental results using different LLMs, corpora, and terminologies are detailed in the Appendix (sections D, E, and F).
read the caption
Table 3: Entity-level code assignment metrics on RuCCoD’s test set. The best results are highlighted in bold. We also refer to Appx. D, E, F on more experiments with different LMs, corpora, and terminologies.

Task	Model or Approach	LR	# Epochs	BS	Scheduler	WD
NER	RuBioBERT	1e-5	20	32	Cosine Loshchilov and Hutter (2017)	0.01
EL	BERGAMOT+BioSyn	2e-5	20	32	Adam (Kingma and Ba, 2015)	0.01
LLM tuning	LoRA	5e-5	33	2	Linear with Warmup	0.01
ICD code prediction	Longformer	5e-5	2	4	Linear with Warmup	0.01

🔼 This table details the models and training hyperparameters used in the experiments. It lists the learning rate (LR), number of epochs, batch size (BS), scheduler type used for adjusting learning rate during training, and weight decay (WD) values for each model. The models are categorized by task (NER, EL, and ICD code prediction). Understanding these hyperparameters is crucial for replicating and interpreting the experimental results.
read the caption
Table 4: Models and training hyperparameters. LR stands for learning rate, BS for batch size, WD for weight decay

Model	Train Data	F1-score	Precision	Recall
RuBioBERT	RuCCoD train	0.756	0.75	0.77
RuBioBERT	BIO-NNE train	0.62	0.57	0.67
RuBioBERT	RuCCoD + BioNNE train	0.72	0.75	0.70
BINDER + RuBioBERT	RuCCoD train	0.71	0.72	0.71

🔼 This table presents the performance of different models on the Named Entity Recognition (NER) task using the RuCCoD dataset. It shows the F1-score, precision, and recall achieved by various models trained on different combinations of training data. The models evaluated include RuBioBERT (with and without BINDER), highlighting their effectiveness in extracting relevant entities from clinical texts in the Russian language.
read the caption
Table 5: Evaluation results for NER task on RuCCoD dataset.

Zero-shot evaluation, strict
Train set	SapBERT		CODER		BERGAMOT
Train set	@1	@5	@1	@5	@1	@5
ICD dict	0.3327	0.5712	0.2631	0.4687	0.3495	0.6170
ICD dict+UMLS synonyms	0.3546	0.5197	0.3237	0.4765	0.3559	0.5487
Supervised evaluation, strict
ICD	0.6132	0.8182	0.6202	0.8169	0.6415	0.8459
ICD+UMLS sumonyms	0.5326	0.7382	0.5358	0.7318	0.4984	0.7253
RuCCoN	0.3591	0.5345	0.3598	0.5732	0.3643	0.5313
RuCCoN+ICD	0.3952	0.5732	0.3888	0.6570	0.3817	0.5983
NEREL-BIO	0.3443	0.4913	0.3378	0.5274	0.3353	0.5113
NEREL-BIO+ICD	0.3804	0.5596	0.3804	0.6325	0.3598	0.5525
Zero-shot evaluation, relaxed
ICD dict	0.4842	0.6886	0.3752	0.6190	0.5035	0.7286
ICD dict+UMLS synonyms	0.5551	0.6867	0.5055	0.6293	0.5603	0.7073
Supervised evaluation, relaxed
ICD	0.7763	0.8839	0.7872	0.8743	0.7917	0.8943
ICD+UMLS sumonyms	0.7788	0.8616	0.7714	0.8860	0.7449	0.8738
RuCCoN	0.5235	0.6531	0.5429	0.7208	0.5132	0.6564
RuCCoN+ICD	0.5493	0.6602	0.5770	0.7485	0.5571	0.6873
NEREL-BIO	0.4803	0.6067	0.4958	0.6634	0.4778	0.6170
NEREL-BIO+ICD	0.5455	0.6447	0.5474	0.7292	0.5384	0.6505

🔼 This table presents the results of experiments evaluating the effectiveness of transfer learning in biomedical entity linking. Four different models (SapBERT, CODER, BERGAMOT) were trained using various combinations of datasets: RuCCoD (Russian ICD Coding Dataset), RuCCON, and NEREL-BIO. Each model was tested with two evaluation methods: ‘strict’ (exact match between predicted and ground truth codes) and ‘relaxed’ (truncated codes to higher level). The results show the precision, recall, F1-score, and accuracy for each model and dataset combination under both strict and relaxed evaluation schemes. One data setting, ICD+UMLS synonyms, involves enriching the training data with disease name synonyms from the UMLS knowledge base to assess the impact of vocabulary expansion.
read the caption
Table 6: Cross-domain transfer results for biomedical linking models. Evaluation results for linking models trained on RuCOD, RuCCoN, NEREL-BIO as well as their union. ICD+UMLS synonyms stands for ICD train set with the vocabulary enriched with ICD disease name synonyms from the UMLS knowledge base. The best results for each model and set-up are highlighted in bold.

Model	Precision	Recall	F-score	Accuracy
NER
Llama3-Med42-8B, RuCCoD	0.642	0.642	0.642	0.473
Qwen2.5-7B-Instruct, RuCCoD	0.567	0.562	0.565	0.393
Phi3_5_mini, RuCCoD	0.632	0.623	0.627	0.457
Mistral-Nemo, RuCCoD	0.631	0.598	0.614	0.443
NER+Linking
Llama3-Med42-8B, ICD dict.	0.149	0.149	0.149	0.08
Llama3-Med42-8B, ICD dict. + RuCCoD	0.299	0.299	0.299	0.176
Llama3-Med42-8B, ICD dict. + BERGAMOT	0.286	0.286	0.286	0.167
Qwen2.5-7B-Instruct, ICD dict.	0.188	0.186	0.187	0.103
Qwen2.5-7B-Instruct, ICD dict. + RuCCoD	0.281	0.279	0.28	0.163
Qwen2.5-7B-Instruct, ICD dict. + BERGAMOT	0.2	0.198	0.199	0.11
Phi3_5_mini, ICD dict.	0.272	0.268	0.27	0.156
Phi3_5_mini, ICD dict. + RuCCoD	0.335	0.33	0.333	0.199
Phi3_5_mini, ICD dict. + BERGAMOT	0.322	0.317	0.32	0.19
Mistral-Nemo, ICD dict.	0.231	0.219	0.224	0.126
Mistral-Nemo, ICD dict. + RuCCoD	0.303	0.287	0.295	0.173
Mistral-Nemo, ICD dict. + BERGAMOT	0.267	0.253	0.26	0.149
Code assignment
Llama3-Med42-8B, ICD dict.	0.229	0.231	0.23	0.13
Llama3-Med42-8B, ICD dict. + RuCCoD	0.434	0.435	0.435	0.278
Llama3-Med42-8B, ICD dict. + BERGAMOT	0.403	0.405	0.404	0.253
Qwen2.5-7B-Instruct, ICD dict.	0.296	0.295	0.295	0.173
Qwen2.5-7B-Instruct, ICD dict. + RuCCoD	0.456	0.449	0.452	0.292
Qwen2.5-7B-Instruct, ICD dict. + BERGAMOT	0.305	0.303	0.304	0.179
Phi3_5_mini, ICD dict.	0.394	0.39	0.392	0.244
Phi3_5_mini, ICD dict. + RuCCoD	0.483	0.477	0.48	0.316
Phi3_5_mini, ICD dict. + BERGAMOT	0.454	0.448	0.451	0.291
Mistral-Nemo, ICD dict.	0.326	0.311	0.319	0.189
Mistral-Nemo, ICD dict. + RuCCoD	0.458	0.435	0.446	0.287
Mistral-Nemo, ICD dict. + BERGAMOT	0.394	0.372	0.383	0.237

🔼 This table presents the performance of several fine-tuned large language models (LLMs) on the RuCCoD dataset for the task of ICD (International Classification of Diseases) coding. The models were evaluated using micro-averaged precision, recall, F1-score, and accuracy. The table shows the results broken down by the model used and the different corpora employed during training (ICD dict., ICD dict.+RuCCoD, ICD dict.+BERGAMOT). The best-performing model and corpus combination for each metric are highlighted in bold, allowing for a direct comparison across various LLMs and training data configurations.
read the caption
Table 7: ICD coding results for finetuned LLMs on RuCCoD. The best results are highlighted in bold.

NER
Model	Precision	Recall	F-score	Accuracy
BioBERT, Biosyn, RuCCoD	0.649	0.655	0.653	0.485
BioBERT, RuCCoD	0.721	0.769	0.744	0.592
BioBERT, NEREL-BIO	0.588	0.675	0.628	0.458
BioBERT, NEREL-BIO, RuCCoD	0.689	0.713	0.701	0.54
BioBERT, RuCCoN	0.637	0.613	0.625	0.454
BioBERT, RuCCoN + RuCCoD	0.609	0.709	0.655	0.487
NER+Linking
BioBERT, Biosyn, RuCCoD	0.392	0.396	0.394	0.245
BioBERT, RuCCoD	0.427	0.455	0.441	0.283
BioBERT, NEREL-BIO	0.353	0.406	0.377	0.233
BioBERT, NEREL-BIO, RuCCoD	0.406	0.42	0.413	0.26
BioBERT, RuCCoN	0.387	0.372	0.379	0.234
BioBERT, RuCCoN + RuCCoD	0.351	0.409	0.378	0.233
Code assignment
BioBERT, Biosyn, RuCCoD	0.507	0.508	0.507	0.340
BioBERT, RuCCoD	0.51	0.542	0.525	0.356
BioBERT, NEREL-BIO	0.466	0.531	0.497	0.33
BioBERT, NEREL-BIO, RuCCoD	0.512	0.529	0.52	0.352
BioBERT, RuCCoN	0.508	0.485	0.496	0.33
BioBERT, RuCCoN + RuCCoD	0.471	0.543	0.504	0.337

🔼 This table presents the performance of a BERT-based information extraction (IE) pipeline on the RuCCoD corpus for three entity-level tasks: Named Entity Recognition (NER), NER + Entity Linking, and ICD Code Assignment. The pipeline uses various combinations of pre-trained models and training corpora (RuCCoD, NEREL-BIO, RuCCON, and BioSyn) for NER and entity linking. The results are shown as precision, recall, F1-score, and accuracy metrics, highlighting the best-performing configurations for each task. This table demonstrates the impact of different model choices and training data on the accuracy of extracting and linking disease-related entities to ICD codes.
read the caption
Table 8: Evaluation results for entity-level tasks for BERT-based IE pipeline on RuCCoD corpus. The best results are highlighted in bold.

NER: ICD dict.
Model	Precision	Recall	F-score	Accuracy
Llama3.1:8b-instruct	0.208	0.088	0.124	0.066
Llama3-Med42-8B	0.202	0.084	0.118	0.063
Phi-3.5-mini-instruct	0.211	0.093	0.129	0.069
Mistral-Nemo-Instruct-2407	0.198	0.072	0.105	0.055
Qwen2.5-7B-Instruct	0.206	0.087	0.122	0.065
NER: ICD dict. + RuCCoD
Llama3.1:8b-instruct	0.581	0.456	0.511	0.343
Llama3-Med42-8B	0.556	0.441	0.492	0.326
Phi-3.5-mini-instruct	0.543	0.450	0.492	0.326
Mistral-Nemo-Instruct-2407	0.541	0.372	0.441	0.283
Qwen2.5-7B-Instruct	0.566	0.440	0.495	0.329
NER+Linking: ICD dict.
Llama3.1:8b-instruct	0.071	0.067	0.069	0.036
Llama3-Med42-8B	0.058	0.063	0.060	0.031
Phi-3.5-mini-instruct	0.062	0.069	0.065	0.034
Mistral-Nemo-Instruct-2407	0.066	0.056	0.060	0.031
Qwen2.5-7B-Instruct	0.065	0.065	0.065	0.033
NER+Linking: ICD dict. + RuCCoD
Llama3.1:8b-instruct	0.272	0.264	0.268	0.155
Llama3-Med42-8B	0.235	0.261	0.247	0.141
Phi-3.5-mini-instruct	0.228	0.257	0.242	0.137
Mistral-Nemo-Instruct-2407	0.247	0.215	0.230	0.130
Qwen2.5-7B-Instruct	0.244	0.246	0.245	0.140
Code assignment: ICD dict.
Llama3.1:8b-instruct	0.379	0.363	0.371	0.228
Llama3-Med42-8B	0.310	0.345	0.327	0.195
Phi-3.5-mini-instruct	0.260	0.294	0.276	0.160
Mistral-Nemo-Instruct-2407	0.413	0.360	0.385	0.238
Qwen2.5-7B-Instruct	0.401	0.411	0.406	0.255
Code assignment: ICD dict. + RuCCoD
Llama3.1:8b-instruct	0.465	0.451	0.458	0.297
Llama3-Med42-8B	0.434	0.483	0.457	0.296
Phi-3.5-mini-instruct	0.409	0.458	0.432	0.276
Mistral-Nemo-Instruct-2407	0.462	0.401	0.429	0.273
Qwen2.5-7B-Instruct	0.461	0.465	0.463	0.301

🔼 This table presents the performance of different Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) on three tasks related to ICD coding: Named Entity Recognition (NER), linking entities to ICD codes, and end-to-end entity linking. It shows precision, recall, F1-score, and accuracy for each LLM on each task, using different knowledge sources (ICD dictionary, RuCCOD dataset) for the RAG component.
read the caption
Table 9: Evaluation results for NER, Code assignment, and end-to-end entity linking task on RuCCoD for LLM+RAG pipeline.

NER: NEREL-BIO
Model	Precision	Recall	F-score	Accuracy
Llama3.1:8b-instruct	0.100	0.042	0.059	0.030
Llama3-Med42-8B	0.104	0.043	0.060	0.031
Phi-3.5-mini-instruct	0.098	0.043	0.059	0.031
Mistral-Nemo-Instruct-2407	0.115	0.044	0.063	0.033
Qwen2.5-7B-Instruct	0.099	0.043	0.060	0.031
NER: RuCCoN
Llama3.1:8b-instruct	0.188	0.088	0.120	0.064
Llama3-Med42-8B	0.174	0.079	0.108	0.057
Phi-3.5-mini-instruct	0.172	0.085	0.114	0.060
Mistral-Nemo-Instruct-2407	0.197	0.082	0.116	0.061
Qwen2.5-7B-Instruct	0.185	0.091	0.122	0.065
NER+Linking: NEREL-BIO
Llama3.1:8b-instruct	0.023	0.020	0.021	0.011
Llama3-Med42-8B	0.018	0.019	0.018	0.009
Phi-3.5-mini-instruct	0.019	0.020	0.019	0.010
Mistral-Nemo-Instruct-2407	0.025	0.020	0.022	0.011
Qwen2.5-7B-Instruct	0.021	0.020	0.020	0.010
NER+Linking: RuCCoN
Llama3.1:8b-instruct	0.050	0.046	0.048	0.025
Llama3-Med42-8B	0.042	0.044	0.043	0.022
Phi-3.5-mini-instruct	0.038	0.041	0.040	0.020
Mistral-Nemo-Instruct-2407	0.053	0.044	0.048	0.025
Qwen2.5-7B-Instruct	0.048	0.046	0.047	0.024
Code assignment: NEREL-BIO
Llama3.1:8b-instruct	0.059	0.053	0.056	0.029
Llama3-Med42-8B	0.045	0.047	0.046	0.024
Phi-3.5-mini-instruct	0.046	0.049	0.047	0.024
Mistral-Nemo-Instruct-2407	0.062	0.051	0.056	0.029
Qwen2.5-7B-Instruct	0.058	0.056	0.057	0.029
Code assignment: RuCCoN
Llama3.1:8b-instruct	0.164	0.150	0.157	0.085
Llama3-Med42-8B	0.125	0.131	0.128	0.068
Phi-3.5-mini-instruct	0.125	0.134	0.129	0.069
Mistral-Nemo-Instruct-2407	0.156	0.129	0.141	0.076
Qwen2.5-7B-Instruct	0.156	0.152	0.154	0.084

🔼 This table presents the performance of different Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) on three tasks related to ICD coding: Named Entity Recognition (NER), NER+linking, and code assignment. The evaluation was performed on the RuCCoD dataset, using either NEREL-BIO or RuCCoN as vectorstores. The results show precision, recall, F-score, and accuracy for each model and task. This helps assess the efficacy of different LLMs and approaches in automating various stages of the ICD coding process.
read the caption
Table 10: Evaluation results for NER, Code assignment, and end-to-end entity linking task on RuCCoD for LLM+RAG pipeline using NEREL-BIO and RuCCoN for vectorstore.

NER: ICD dict.
Model	Precision	Recall	F-score	Accuracy
Llama3.1:8b-instruct	0.208	0.088	0.124	0.066
Llama3-Med42-8B	0.202	0.084	0.118	0.063
Phi-3.5-mini-instruct	0.211	0.093	0.129	0.069
Mistral-Nemo-Instruct-2407	0.198	0.072	0.105	0.055
Qwen2.5-7B-Instruct	0.206	0.087	0.122	0.065
NER: ICD dict. + RuCCoD
Llama3.1:8b-instruct	0.581	0.456	0.511	0.343
Llama3-Med42-8B	0.556	0.441	0.492	0.326
Phi-3.5-mini-instruct	0.543	0.450	0.492	0.326
Mistral-Nemo-Instruct-2407	0.541	0.372	0.441	0.283
Qwen2.5-7B-Instruct	0.566	0.440	0.495	0.329
NER+Linking: ICD dict.
Llama3.1:8b-instruct	0.095	0.088	0.091	0.048
Llama3-Med42-8B	0.077	0.083	0.080	0.042
Phi-3.5-mini-instruct	0.083	0.092	0.087	0.046
Mistral-Nemo-Instruct-2407	0.083	0.070	0.076	0.040
Qwen2.5-7B-Instruct	0.087	0.086	0.087	0.045
NER+Linking: ICD dict. + RuCCoD
Llama3.1:8b-instruct	0.378	0.362	0.369	0.227
Llama3-Med42-8B	0.324	0.354	0.338	0.203
Phi-3.5-mini-instruct	0.323	0.357	0.339	0.204
Mistral-Nemo-Instruct-2407	0.342	0.295	0.317	0.188
Qwen2.5-7B-Instruct	0.343	0.340	0.342	0.206
Code assignment: ICD dict.
Llama3.1:8b-instruct	0.575	0.561	0.568	0.396
Llama3-Med42-8B	0.523	0.594	0.556	0.385
Phi-3.5-mini-instruct	0.437	0.510	0.471	0.308
Mistral-Nemo-Instruct-2407	0.598	0.533	0.564	0.392
Qwen2.5-7B-Instruct	0.595	0.618	0.607	0.435
Code assignment: ICD dict. + RuCCoD
Llama3.1:8b-instruct	0.701	0.684	0.692	0.529
Llama3-Med42-8B	0.644	0.720	0.680	0.515
Phi-3.5-mini-instruct	0.627	0.703	0.663	0.496
Mistral-Nemo-Instruct-2407	0.691	0.605	0.645	0.476
Qwen2.5-7B-Instruct	0.700	0.704	0.702	0.541

🔼 This table presents the results of experiments evaluating the performance of different Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) on the RuCCoD dataset. The evaluation used a relaxed scoring approach, focusing on the NER (Named Entity Recognition), code assignment, and end-to-end entity linking tasks. The models are compared based on their precision, recall, F1-score, and accuracy, with results shown for various configurations, including the use of different dictionaries and datasets in the RAG pipeline. The results are analyzed to understand the effectiveness of each model in the relaxed setting.
read the caption
Table 11: Relaxed evaluation results for NER, Code assignment, and end-to-end entity linking task on RuCCoD for LLM+RAG pipeline.

NER: NEREL-BIO
Model	Precision	Recall	F-score	Accuracy
Llama3.1:8b-instruct-fp16	0.100	0.042	0.059	0.030
Llama3-Med42-8B	0.104	0.043	0.060	0.031
Phi-3.5-mini-instruct	0.098	0.043	0.059	0.031
Mistral-Nemo-Instruct-2407	0.115	0.044	0.063	0.033
Qwen2.5-7B-Instruct	0.099	0.043	0.060	0.031
NER: RuCCoN
Llama3.1:8b-instruct-fp16	0.188	0.088	0.120	0.064
Llama3-Med42-8B	0.174	0.079	0.108	0.057
Phi-3.5-mini-instruct	0.172	0.085	0.114	0.060
Mistral-Nemo-Instruct-2407	0.197	0.082	0.116	0.061
Qwen2.5-7B-Instruct	0.185	0.091	0.122	0.065
NER+Linking: NEREL-BIO
Llama3.1:8b-instruct	0.033	0.029	0.031	0.016
Llama3-Med42-8B	0.024	0.025	0.025	0.013
Phi-3.5-mini-instruct	0.026	0.028	0.027	0.014
Mistral-Nemo-Instruct-2407	0.033	0.027	0.030	0.015
Qwen2.5-7B-Instruct	0.030	0.029	0.030	0.015
NER+Linking: RuCCoN
Llama3.1:8b-instruct	0.076	0.069	0.072	0.038
Llama3-Med42-8B	0.061	0.063	0.062	0.032
Phi-3.5-mini-instruct	0.060	0.064	0.062	0.032
Mistral-Nemo-Instruct-2407	0.076	0.062	0.068	0.035
Qwen2.5-7B-Instruct	0.073	0.070	0.072	0.037
Code assignment: NEREL-BIO
Llama3.1:8b-instruct	0.114	0.107	0.110	0.058
Llama3-Med42-8B	0.088	0.096	0.092	0.048
Phi-3.5-mini-instruct	0.098	0.110	0.104	0.055
Mistral-Nemo-Instruct-2407	0.121	0.105	0.112	0.059
Qwen2.5-7B-Instruct	0.125	0.126	0.125	0.067
Code assignment: RuCCoN
Llama3.1:8b-instruct	0.295	0.282	0.288	0.168
Llama3-Med42-8B	0.254	0.275	0.264	0.152
Phi-3.5-mini-instruct	0.248	0.273	0.260	0.149
Mistral-Nemo-Instruct-2407	0.284	0.244	0.263	0.151
Qwen2.5-7B-Instruct	0.292	0.294	0.293	0.172

🔼 Table 12 presents the relaxed evaluation metrics for three tasks: Named Entity Recognition (NER), ICD code assignment, and end-to-end entity linking. The evaluation is performed on the RuCCoD dataset using the LLM+RAG pipeline. Specifically, the results showcase the performance of several large language models (LLMs) in these tasks when using either the NEREL-BIO or RuCCoN dataset as a vector store for retrieval augmented generation (RAG). The metrics presented include precision, recall, F-score, and accuracy, offering a comprehensive view of the models’ performance under relaxed evaluation conditions.
read the caption
Table 12: Relaxed evaluation results for NER, Code assignment, and end-to-end entity linking task on RuCCoD for LLM+RAG pipeline using NEREL-BIO and RuCCoN for vectorstore.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

ICD Auto-Coding#

RuCCoD Dataset#

BERT vs LLAMA#

EHR Improvement#

AI-Driven Assist.#

More visual insights#

Full paper#