Objective and Contribution

Proposed a novel multi-modal machine machine learning (ML) model for predicting ICD-10 coding. This model is an ensemble model that combines three different ML models that have been developed to handle the three different data types: unstructured, semi-structured, and structured data. Our model outperformed all the baseline models and has high interpretability level not far from physicians.

The Clinical ICD Landscape

ICD is a medical classification list of codes for diagnose and procedures. These codes are widely used for reimbursement, storage and retrieval of diagnostic information. The process of assigning ICD codes is time-consuming as clinical coders need to extract key information from Electronic Medical Records (EMRs) and assign the correct codes. Coding errors are common and can be costly. EMRs usually store data in three different modalities:

  1. Unstructured text. Nursing notes, lab reports, test reports, and discharge summaries

  2. Semi-structured text. List of structured phrases that describe the diagnoses written by physicians

  3. Structured tabular data. Contains prescriptions and clinical measurements such as numerical test results


The evaluation dataset is the Medical Information Mart for Intensive Care III (MIMIC-III). In total, there are 44,659 admissions. The diagnostic codes are map from original ICD-9 to ICD-10 (one-to-one). The dataset covers the 32 ICD codes that are the top 50 frequency in the MIMIC-III and a national hospital in the US. There are 6 tables:

  1. Admissions. All info on patient admission

  2. Labevents. All laboratory measurements

  3. Prescriptions. Medications related to order entries

  4. Microbiologyevents. Microbiology information

  5. Chartevents. All charted data of the patients’ routine signs and other related health information

  6. Noteevents. All the notes including nursing and physician notes, discharge summaries, and echocardiography reports


Figure above is our ensemble-based model, which combines the following three different ML models:

  1. Text-CNN. Use for multi-label classification on unstructured text

  2. Char-CNN + BiLSTM. Use to analyse the semantic similarities between diagnosis descriptions and ICD code descriptions

  3. Decision Tree. Transform structured numeric features to binary features to classify ICD codes

During inference, our model combines the three ML models for predictions and key evidence are retrieved from the raw data for examination to improve explainability.


For the unstructured data, we have the Noteevents. This includes two steps:

  1. Data pre-processing

  2. Text-CNN classification

For data pre-processing, it is a simple cleaning and standardising the input for step 2. For step 2, we use Text-CNN for multi-label classification. We also modified Text-CNN to develop TEXT-TF-IDF-CNN as shown below. This model includes TFIDF vectors of keywords and phrases extracted from unstructured guidelines, to mimic the real-world situation where clinical guidelines are often used to guide diagnoses. The additional TFIDF inputs are feed into the fully connected layers of Text-CNN.

There exists class imbalance in the dataset which could reduce the performance of our ML models and so we decided to use Label Smoothing Regularisation (LSR), which prevents our classifier from being too certain about labels during training.


Clinical coders often try to extract key phrases and sentences in clinical notes and assign them to the appropriate ICD coding. Most often, there is a close semantic similarity between the coding descriptions and the diagnosis descriptions. We formulate this process as Diagnosis-based Ranking (DR) problem where all the code descriptions are represented in the low-dimensional dense vector space. During inference, the diagnosis descriptions are mapped to the same vector space and ICD codes are ranked based on the distance between the diagnosis vector and each coding vectors. With this, we decided to go for the following architecture as shown below.

We used both character-level CNN and pre-trained word embeddings to encode the diagnoses and ICD code descriptions into the same space. The word embeddings are pre-trained on PubMed, which contains over 550,000 biomedical papers. The encoded embeddings are then feed into a biLSTM and a max-pooling layer to generate the final feature vector.

The loss function captures the relative similarity between instances by minimising the distance between the diagnosis example and the positive example (positive pairs) and maximising the distance between the diagnosis example and the negative example (negative pairs). Distances are measured in Euclidean. The MIMIC-III dataset does not have a one-to-one mapping of ICD code and diagnosis and so we crawled online to extract the synonyms of the ICD-10 codes. All the synonyms for each ICD code are the positive examples. The negative examples are created using the n-grams that are similar to the code description.

Decision Tree

Table 2 – 5 are all tabular data. Our approach is to apply a decision tree over the binary features in the tables and leveraged the one-vs-all strategy for multi-label classification. To counter class imbalance, higher weights were given to samples from minority classes.

Model Ensemble

During inference time, our ensemble model takes the weighted sum of probabilities predicted from the three individual models to compute the final predicted probability for each class.

Interpretability Methods

In order to identify key phrases that lead to the predicted ICD code, we attempted to capture the association strength between a word w and ICD code y. We do so by extracting all the paths connecting w and y from our neural network and compute the influence score. The scores of all paths are then added up to measure the association strength. To capture key phrases, we combined consecutive words that have non-zero scores and ranked them by highest score. Top ranked phrases are considered to be important signals for determining the specific ICD code prediction.

For each tabular features, we use Local Interpretable Model-Agnostic Explanation (LIME) to compute how important the feature is in contributing to the model’s final prediction.

Experimental Setup and Results

We used two evaluation metrics to measure the classification performance and interpretability of our models:

  1. Classification. F1 and AUC to measure the precision and recall and summarises the performance under different thresholds

  2. Interpretability. Jaccard Similarity Coefficient (JSC) to measure the overlap between the extracted evidence and physicians’ annotations


We used different variation of TFIDF as baselines. Most models are CNN-based. The vanilla Text-CNN and DenseNet performed similarly to the baseline models in F1 scores and outperformed baselines in AUC scores. Label smoothing was effective in alleviating the class imbalance problem as shown by the strong improvement in F1 and AUC scores. Similar improvement is shown with diagnosis-based ranking.

Our F1 and AUC scores performance continues to increase as we moved to the different variations of our ensemble models. Text-CNN + LS + DR + TD has shown a 7% increase in macro-F1 score over the vanilla Text-CNN and similar improvement results across other metrics. This showcase the effectiveness of our ensemble methods.

In the last section, we showcase a further strong improvement in performance by incorporating clinical guidelines in the form of TFIDF feature vectors. This showcase an increase of 7% in macro-F1 score over the best performing ensemble model without external clinical guidelines. Our Text-TFIDF-CNN + LS + DR + TD outperformed all baseline and ensemble models and this tells us that incorporating external knowledge to the classification tasks would improve the model’s performance significantly.

In terms of interpretability evaluation, we collected a test set of 25 samples from 5 ICD-10 codes and annotations are made on them from 3 experienced physicians. We compare the top-k phrases extracted from our model with the human annotations and measured the overlap score between them. The results are shown in table 2 above. On average, our model obtains JSC of 0.1806 for text data. Our model is able to capture phrases that are directly related to a specific disease or provide indirect relationship for final prediction.

In terms of tabular data, we selected the k most important features found by LIME as evidence for the model’s predictions. Again, we computed the overlap score between these features and human annotations and results are display in table 2. The average JSC between human annotators is 0.5, which is higher than the average JSC between our model and human annotators of 0.31. Overall, our model was able to capture more features than human annotators and some of those features are useful for diagnosis and weren’t picked up by human annotators.

Conclusion and Future Work

Overall, our ensembled model has outperformed all the baseline methods and we see further improvement in performance by incorporating human knowledge into the models. In addition, our model’s prediction is much more explainable. The potential future work includes enlarging the coding list, reducing feature dimensions for tabular data, and further investigation in different methods of adding human knowledge could yield better results.



Data Scientist

Leave a Reply