Objective and Contribution

Proposed Convolutional Attention for Multi-Label classification (CAML), an attentional CNN for multi-label document classification, that takes in clinical notes and predicts medical codes that best describe the diagnosis and procedures. The attention mechanism would select the most relevant segments for each of the possible codes. Our approach beats the previous SOTA results on both MIMIC-II and MIMIC-III, where we achieved 0.71 and 0.54 in precision@8 and Micro-F1 respectively. In addition, the attention mechanism gives us transparency into which part of the clinical notes are used to come up with each code assignment.


We treat ICD-9 code prediction as a multi-label text classification problem, where for each clinical note, we want to classify which ICD-9 codes is within it. CAML has 3 components:

  1. CNN

  2. Attention Mechanism

  3. Classification


Here, our CNN takes in a matrix of word embeddings of the document and using the convolutional filter, we compute a hidden state at each time step. This results in a hidden state matrix H.

Attention Mechanism

Normally, we would apply pooling method across the matrix H, however, since our goal is to assign multiple labels for each document, and different parts of the document are link to different labels, we apply a per-label attention mechanism. This means that for each label, we compute matrix-vector multiplication between matrix H and the corresponding label vector. This output multiplication is feed into a softmax layer to output an attention vector specific to that label. We repeat this process for each label to obtain attention vector for every single label in our dataset. This per-label attention is then used to compute the vector document representation of each label.


Given the vector document representation, we want to compute the probability for label l using fully connected layer with a sigmoid function.

Note that many codes are rarely seen in the labelled data and as such we decided to use text descriptions of each code as shown in the table below. We build a separate module to learn how to encode these descriptions into vectors. These vectors act as regularisation where if a particular code is rarely observed in the training data, the regulariser will maintain the parameters to be similar to other codes with similar descriptions.


We have two evaluation datasets: MIMIC-II and MIMIC-III, where we focus on discharge summaries. There are around 9000 unique ICD-9 codes of which 7000 are diagnosis codes and 2000 are procedure codes. Additionally, we also created a subset of data called MIMIC-III 50, where we train and evaluate on the top 50 most frequent labels for ease of comparison to previous work.

Our comparison baseline models are single-layer CNN, logistic regression, and bi-GRU. For our evaluation metrics, we used micro-averaged and macro-averaged F1 and AUC score. Micro-average are calculated using each pair of text and code as separate prediction whereas macro-averaged are calculated by averaging the metrics per label. We also computed precision at n (p@n), which reports the fraction of the n highest-scored labels in the ground truth.


The main results are displayed below. The CAML model outperformed the three baseline models in both macro and micro-averaged in all metrics. The attention mechanism yielded strong improvements over the CNN model. Logistic regression performed the worst among the neural methods. We believe that precision@8 gives us the most informative results as it measures how confident the model can return a small subset of codes.

We also performed another evaluation where we evaluate our model on the 50 most common codes in MIMIC-III and MIMIC-II. The results for both are displayed below. For MIMIC-III 50, DR-CAML performed the best in all metrics except P@5, where CNN baseline outperformed our model. We believe the high value of k = 10 for CAML makes the model more suited for larger datasets. For MIMIC-II, our CAML and DR-CAML outperformed all baseline models.


We want to evaluate the explanations generated by our models. We provided explanations from four different methods (which includes our CAML) of 100 randomly sampled predicted codes. The most important k-gram and a window of five words on either side from each method was extracted. This is shown in the figure below. We asked the physicians to select text spans that explains a particular code, provide the code and its description and optionally distinguish text spans as “highly informative”.

The results of the interpretability evaluation are displayed below. Our CAML model has the highest “highly informative” explanations and has higher “informative” explanations than logistic regression and CNN. The cosine similarity also performs well but there exists 12% of the time where cosine similarity was unable to provide any explanations as there were no k-grams in a note that have non-zero similarity when compared to a label descriptions.

Conclusion and Future Work

Potential future work could involve incorporating discharge summaries in MIMIC-III and to come up with a better methodology to handle out-of-vocabulary tokens and non-standard phrases. In addition, we could also investigate using hierarchy of ICD codes for better predictions.



Data Scientist

Leave a Reply