Objective and Contribution

Proposed Multi-Filter Residual CNN (MultiResCNN) for ICD coding using discharge summaries. Our models are innovative in that a) we have multiple filter CNN layer for capturing different lengths of text documents and b) we have a residual convolutional layer for enlarging the receptive field. Our model outperformed previous baseline models in 4 out of 6 evaluation metrics on full set MIMIC-III. For MIMIC-II and top-50 code MIMIC-III, it outperformed all baseline models.


Our MultiResCNN has five layers:

  1. Input layer

  2. Multi-filter convolutional layer

  3. Residual convolutional layer

  4. Attention layer

  5. Output layer

Input layer

Here, we encode the input sequence using pre-trained word embeddings and compute the input sequence embedding matrix E, where each word represents a vector in embedding matrix E.

Multi-filter convolutional layer

Here, in order to capture the different lengths in input sequence, we utilised multiple filters CNN where each filter has different kernel size. Therefore, we have m number of 1D convolutions (where the row length is fixed) to apply to our input sequence embedding matrix, where m is the number of filters. We apply padding and stride length to ensure that the sequence length is unchanged after convolution. The overview of this process is illustrated below.

Residual convolutional layer

On top of each filter, there is a residual convolutional layer which consists of p number of blocks. Each residual block consists of three convolutional filters as shown in the figure below. Since we have m filters, we would have the same process as the figure below for each filter, resulting in m number of output matrix. The final output of the residual convolutional layer is the concatenation of the output of m residual blocks to create matrix H.

Attention layer

Here, we employed the per-label attention mechanism to allow the model to attend different parts of the document representation H. The matrix A in the architecture figure above is the attention weights for each pair of ICD code and a word. The matrix V represents the output of the attention layer which is computed by element-wise multiplication between attention matrix and document representation H.

Output layer

The matrix V is feed into a fully connected linear layer with a sum-pooling function to obtain the score vector for all ICD codes. The final probability output vector is computed by feeding the score vector into a sigmoid function since this is a multi-label classification.

Hyperparameter tuning

How many filter should we use? What filter size should we consider? How many residual block? These are all hyperparameters of our MultiResCNN. At the end, we have 6 filters at size 3, 5, 9, 15, 19, and 25 and we have one residual block. We derived these from our experiment of three different models:

  1. CNN. Only one convolutional filter

  2. MultiCNN. Has multi-filter convolutional layer

  3. ResCNN. Only has the residual layer

We explored with different configurations for each variation and evaluate the performance of the model on the dev set of full and top-50 codes MIMIC-III dataset. The results are displayed below. As shown, both MultiCNN and ResCNN outperformed the vanilla CNN model. The performance peaked when we have 6 filters for MultiCNN and performance was the highest for ResCNN when we have one residual block. When combining the best of MultiCNN and ResCNN, we created MultiResCNN which outperformed both models individually.


Our evaluation datasets are MIMIC-II and MIMIC-III. With MIMIC-III, we explored both full codes and top-50 most frequent codes. The MIMIC-III full code has 8921 ICD-9 codes with 47719, 1631, and 3372 discharge summaries for train, dev, and test. The MIMIC-III top-50 codes has 8067, 1574, and 1730 discharge summaries for train, dev, and test.

Our evaluation metrics are macro-averaged and micro-averaged AUC and F1 scores and precision at 8 and 15. Our baseline models are:

  1. CAML and DR-CAML

  2. C-MemNN. Condensed Memory Neural Network

  3. C-LSTM-Att. Character-aware LSTM with attention mechanism

  4. SVM. Both flat and hierarchical SVM

  5. HA-GRU. Hierarchical Attention GRU


The table below showcase the results of MIMIC-III full codes. MultiResCNN outperformed CAML in 4 out of 6 of the evaluation metrics and achieved competitively in the other two. Our model is able to obtain stable results as evident from the low standard deviation.

The table below showcase the results of MIMIC-III top-50 codes. MultiResCNN outperformed all our baseline models in all metrics. Our model improves against the best performing baseline model by 0.015, 0.012, 0.03, 0.037, and 0.023 in macro-AUC, micro-AUC, macro-F1, micro-F1, and precision@5 respectively.

The table below showcase the results of MIMIC-II. Again, our MultiResCNN outperformed all the baseline models in all metrics.


Computational Cost Analysis

We analysed the computational cost between CAML and MultiResCNN in four different aspects: parameter amount, training time, training epoch, and inference speed as shown in the table below. The table shows that our model MultiResCNN is 1.9x bigger than CAML and takes 2.3x longer to train. However, we believe that the performance gain justify this level of increase in computational cost.

Effect of Truncating Data

We truncated any discharge summaries longer than 2500 tokens and we want to assess the effect of this truncation on the performance of our model. We experimented with 3500 – 6500 in 1000 increments and found that the performance differences are negligible.

Conclusion and Future Work

Potential future work could explore how to incorporate BERT into this task effectively. We tested BERT and it did not perform well due to hardware and fixed-length context limitations. We could potentially experiment with recurrent Transformer and hierarchical BERT. Lastly, we could further explore how to choose the number of kernel and kernel size more optimally than just empirically.



Data Scientist

Leave a Reply