Objective and Contribution
Proposed Convolutional Attention for MultiLabel classification (CAML), an attentional CNN for multilabel document classification, that takes in clinical notes and predicts medical codes that best describe the diagnosis and procedures. The attention mechanism would select the most relevant segments for each of the possible codes. Our approach beats the previous SOTA results on both MIMICII and MIMICIII, where we achieved 0.71 and 0.54 in precision@8 and MicroF1 respectively. In addition, the attention mechanism gives us transparency into which part of the clinical notes are used to come up with each code assignment.
Methodology
We treat ICD9 code prediction as a multilabel text classification problem, where for each clinical note, we want to classify which ICD9 codes is within it. CAML has 3 components:

CNN

Attention Mechanism

Classification
CNN
Here, our CNN takes in a matrix of word embeddings of the document and using the convolutional filter, we compute a hidden state at each time step. This results in a hidden state matrix H.
Attention Mechanism
Normally, we would apply pooling method across the matrix H, however, since our goal is to assign multiple labels for each document, and different parts of the document are link to different labels, we apply a perlabel attention mechanism. This means that for each label, we compute matrixvector multiplication between matrix H and the corresponding label vector. This output multiplication is feed into a softmax layer to output an attention vector specific to that label. We repeat this process for each label to obtain attention vector for every single label in our dataset. This perlabel attention is then used to compute the vector document representation of each label.
Classification
Given the vector document representation, we want to compute the probability for label l using fully connected layer with a sigmoid function.
Note that many codes are rarely seen in the labelled data and as such we decided to use text descriptions of each code as shown in the table below. We build a separate module to learn how to encode these descriptions into vectors. These vectors act as regularisation where if a particular code is rarely observed in the training data, the regulariser will maintain the parameters to be similar to other codes with similar descriptions.
Experiments
We have two evaluation datasets: MIMICII and MIMICIII, where we focus on discharge summaries. There are around 9000 unique ICD9 codes of which 7000 are diagnosis codes and 2000 are procedure codes. Additionally, we also created a subset of data called MIMICIII 50, where we train and evaluate on the top 50 most frequent labels for ease of comparison to previous work.
Our comparison baseline models are singlelayer CNN, logistic regression, and biGRU. For our evaluation metrics, we used microaveraged and macroaveraged F1 and AUC score. Microaverage are calculated using each pair of text and code as separate prediction whereas macroaveraged are calculated by averaging the metrics per label. We also computed precision at n (p@n), which reports the fraction of the n highestscored labels in the ground truth.
Results
The main results are displayed below. The CAML model outperformed the three baseline models in both macro and microaveraged in all metrics. The attention mechanism yielded strong improvements over the CNN model. Logistic regression performed the worst among the neural methods. We believe that precision@8 gives us the most informative results as it measures how confident the model can return a small subset of codes.
We also performed another evaluation where we evaluate our model on the 50 most common codes in MIMICIII and MIMICII. The results for both are displayed below. For MIMICIII 50, DRCAML performed the best in all metrics except P@5, where CNN baseline outperformed our model. We believe the high value of k = 10 for CAML makes the model more suited for larger datasets. For MIMICII, our CAML and DRCAML outperformed all baseline models.
Interpretability
We want to evaluate the explanations generated by our models. We provided explanations from four different methods (which includes our CAML) of 100 randomly sampled predicted codes. The most important kgram and a window of five words on either side from each method was extracted. This is shown in the figure below. We asked the physicians to select text spans that explains a particular code, provide the code and its description and optionally distinguish text spans as “highly informative”.
The results of the interpretability evaluation are displayed below. Our CAML model has the highest “highly informative” explanations and has higher “informative” explanations than logistic regression and CNN. The cosine similarity also performs well but there exists 12% of the time where cosine similarity was unable to provide any explanations as there were no kgrams in a note that have nonzero similarity when compared to a label descriptions.
Conclusion and Future Work
Potential future work could involve incorporating discharge summaries in MIMICIII and to come up with a better methodology to handle outofvocabulary tokens and nonstandard phrases. In addition, we could also investigate using hierarchy of ICD codes for better predictions.