Objective and Contribution
Proposed MultiFilter Residual CNN (MultiResCNN) for ICD coding using discharge summaries. Our models are innovative in that a) we have multiple filter CNN layer for capturing different lengths of text documents and b) we have a residual convolutional layer for enlarging the receptive field. Our model outperformed previous baseline models in 4 out of 6 evaluation metrics on full set MIMICIII. For MIMICII and top50 code MIMICIII, it outperformed all baseline models.
MultiResCNN
Our MultiResCNN has five layers:

Input layer

Multifilter convolutional layer

Residual convolutional layer

Attention layer

Output layer
Input layer
Here, we encode the input sequence using pretrained word embeddings and compute the input sequence embedding matrix E, where each word represents a vector in embedding matrix E.
Multifilter convolutional layer
Here, in order to capture the different lengths in input sequence, we utilised multiple filters CNN where each filter has different kernel size. Therefore, we have m number of 1D convolutions (where the row length is fixed) to apply to our input sequence embedding matrix, where m is the number of filters. We apply padding and stride length to ensure that the sequence length is unchanged after convolution. The overview of this process is illustrated below.
Residual convolutional layer
On top of each filter, there is a residual convolutional layer which consists of p number of blocks. Each residual block consists of three convolutional filters as shown in the figure below. Since we have m filters, we would have the same process as the figure below for each filter, resulting in m number of output matrix. The final output of the residual convolutional layer is the concatenation of the output of m residual blocks to create matrix H.
Attention layer
Here, we employed the perlabel attention mechanism to allow the model to attend different parts of the document representation H. The matrix A in the architecture figure above is the attention weights for each pair of ICD code and a word. The matrix V represents the output of the attention layer which is computed by elementwise multiplication between attention matrix and document representation H.
Output layer
The matrix V is feed into a fully connected linear layer with a sumpooling function to obtain the score vector for all ICD codes. The final probability output vector is computed by feeding the score vector into a sigmoid function since this is a multilabel classification.
Hyperparameter tuning
How many filter should we use? What filter size should we consider? How many residual block? These are all hyperparameters of our MultiResCNN. At the end, we have 6 filters at size 3, 5, 9, 15, 19, and 25 and we have one residual block. We derived these from our experiment of three different models:

CNN. Only one convolutional filter

MultiCNN. Has multifilter convolutional layer

ResCNN. Only has the residual layer
We explored with different configurations for each variation and evaluate the performance of the model on the dev set of full and top50 codes MIMICIII dataset. The results are displayed below. As shown, both MultiCNN and ResCNN outperformed the vanilla CNN model. The performance peaked when we have 6 filters for MultiCNN and performance was the highest for ResCNN when we have one residual block. When combining the best of MultiCNN and ResCNN, we created MultiResCNN which outperformed both models individually.
Experiments
Our evaluation datasets are MIMICII and MIMICIII. With MIMICIII, we explored both full codes and top50 most frequent codes. The MIMICIII full code has 8921 ICD9 codes with 47719, 1631, and 3372 discharge summaries for train, dev, and test. The MIMICIII top50 codes has 8067, 1574, and 1730 discharge summaries for train, dev, and test.
Our evaluation metrics are macroaveraged and microaveraged AUC and F1 scores and precision at 8 and 15. Our baseline models are:

CAML and DRCAML

CMemNN. Condensed Memory Neural Network

CLSTMAtt. Characteraware LSTM with attention mechanism

SVM. Both flat and hierarchical SVM

HAGRU. Hierarchical Attention GRU
Results
The table below showcase the results of MIMICIII full codes. MultiResCNN outperformed CAML in 4 out of 6 of the evaluation metrics and achieved competitively in the other two. Our model is able to obtain stable results as evident from the low standard deviation.
The table below showcase the results of MIMICIII top50 codes. MultiResCNN outperformed all our baseline models in all metrics. Our model improves against the best performing baseline model by 0.015, 0.012, 0.03, 0.037, and 0.023 in macroAUC, microAUC, macroF1, microF1, and precision@5 respectively.
The table below showcase the results of MIMICII. Again, our MultiResCNN outperformed all the baseline models in all metrics.
Discussion
Computational Cost Analysis
We analysed the computational cost between CAML and MultiResCNN in four different aspects: parameter amount, training time, training epoch, and inference speed as shown in the table below. The table shows that our model MultiResCNN is 1.9x bigger than CAML and takes 2.3x longer to train. However, we believe that the performance gain justify this level of increase in computational cost.
Effect of Truncating Data
We truncated any discharge summaries longer than 2500 tokens and we want to assess the effect of this truncation on the performance of our model. We experimented with 3500 – 6500 in 1000 increments and found that the performance differences are negligible.
Conclusion and Future Work
Potential future work could explore how to incorporate BERT into this task effectively. We tested BERT and it did not perform well due to hardware and fixedlength context limitations. We could potentially experiment with recurrent Transformer and hierarchical BERT. Lastly, we could further explore how to choose the number of kernel and kernel size more optimally than just empirically.