1. Installation of SciSpacy Model

In [1]:
# Installation of scispacy large model --> 785K vocab and 600K word vectors
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_lg-0.2.5.tar.gz

2. Import dependencies and load our SciSpacy Model and Pipeline

In [1]:
import pandas as pd

import spacy
from scispacy.abbreviation import AbbreviationDetector
from scispacy.umls_linking import UmlsEntityLinker # Not EntityLinker (see UMLS Entity Linker section)
In [2]:
nlp = spacy.load("en_core_sci_lg")
In [3]:
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
In [4]:
linker = UmlsEntityLinker(resolve_abbreviations=True)
nlp.add_pipe(linker)
/Users/rong2/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.20.3 when using version 0.23.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/Users/rong2/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.20.3 when using version 0.23.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
In [5]:
doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

3. Abbreviation Detector

Detects abbreviation within text but only if the long-form text are within text as the abbreviation detector is rule-based.

In [6]:
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
Abbreviation 	 Definition
SBMA 	 (33, 34) Spinal and bulbar muscular atrophy
SBMA 	 (6, 7) Spinal and bulbar muscular atrophy
AR 	 (29, 30) androgen receptor

4. UMLS Entity Linker

Note that SciSpacy has changed and instead of EntityLinker, they now have UmlsEntityLinker. I also changed ‘kb_ents’ to ‘umls_ents’ and ‘linker.kb’ to ‘linker.umls’ for the script to work 🙂

Looking at the first entity below, each entity is mapped to its UMLS (if applicable). We would select the most relevant one (the highest probability).

In [10]:
# Let's look at an entity!
entity = doc.ents[0]

print("Name: ", entity)

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
for umls_ent in entity._.umls_ents:
    print(umls_ent)
Name:  Spinal
('C0521329', 0.9999999403953552)
('C0037922', 0.8044087886810303)
('C3887662', 0.7648879885673523)
('C1334264', 0.7204329967498779)
('C0037925', 0.7181439399719238)
In [42]:
concept_entity = []
all_umls_data = []
for entity in doc.ents:
    print("Name: ", entity)
    highest_umls_ent = entity._.umls_ents[0]
    
    concept_entity.append((highest_umls_ent[0], entity))
    
    umls_data = linker.umls.cui_to_entity[highest_umls_ent[0]]
    print(umls_data)
    all_umls_data.append(umls_data)
    print('\n')
Name:  Spinal
CUI: C0521329, Name: spinal
Definition: Of or relating to the spine or spinal cord.
TUI(s): T082
Aliases: (total: 3): 
	 Spinal, Spinal, Spinal


Name:  bulbar muscular atrophy
CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the gene encoding the ANDROGEN RECEPTOR.
TUI(s): T047
Aliases (abbreviated, total: 50): 
	 Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, Atrophies, X-Linked Bulbo-Spinal, Bulbo Spinal Atrophy, X Linked, Bulbo-Spinal Atrophies, X-Linked, X-Linked Bulbo-Spinal Atrophies, Atrophy, X-Linked Bulbo-Spinal, X Linked Bulbo Spinal Atrophy, X-Linked Bulbo-Spinal Atrophy, X-Linked Bulbo-Spinal Atrophy


Name:  SBMA
CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the gene encoding the ANDROGEN RECEPTOR.
TUI(s): T047
Aliases (abbreviated, total: 50): 
	 Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, Atrophies, X-Linked Bulbo-Spinal, Bulbo Spinal Atrophy, X Linked, Bulbo-Spinal Atrophies, X-Linked, X-Linked Bulbo-Spinal Atrophies, Atrophy, X-Linked Bulbo-Spinal, X Linked Bulbo Spinal Atrophy, X-Linked Bulbo-Spinal Atrophy, X-Linked Bulbo-Spinal Atrophy


Name:  inherited
CUI: C0439660, Name: Hereditary
Definition: Transmitted from parent to child by information contained in the genes.
TUI(s): T169
Aliases (abbreviated, total: 14): 
	 Hereditary, Hereditary, Hereditary, HEREDITARY, hereditary, hereditary, Inherited, INHERITED, inherited, inherited


Name:  motor neuron disease
CUI: C0085084, Name: Motor Neuron Disease
Definition: Diseases characterized by a selective degeneration of the motor neurons of the spinal cord, brainstem, or motor cortex. Clinical subtypes are distinguished by the major site of degeneration. In AMYOTROPHIC LATERAL SCLEROSIS there is involvement of upper, lower, and brainstem motor neurons. In progressive muscular atrophy and related syndromes (see MUSCULAR ATROPHY, SPINAL) the motor neurons in the spinal cord are primarily affected. With progressive bulbar palsy (BULBAR PALSY, PROGRESSIVE), the initial degeneration occurs in the brainstem. In primary lateral sclerosis, the cortical neurons are affected in isolation. (Adams et al., Principles of Neurology, 6th ed, p1089)
TUI(s): T047
Aliases (abbreviated, total: 32): 
	 Motor Neuron Disease, disease motor neuron, MOTOR NEURON DISEASE, Motor neuron disease, Motor neuron disease, Motor neuron disease, motor neuron disease, motor neuron disease, Motor Neuron Diseases, Neuron Diseases, Motor


Name:  expansion
CUI: C0007595, Name: cell growth
Definition: The process in which a cell irreversibly increases in size over time by accretion and biosynthetic production of matter similar to that already present. [GOC:ai]
TUI(s): T043
Aliases (abbreviated, total: 14): 
	 cell growth, cell growth, cell growth, cell growth, Cell Growth, cell growths, cells growth, Cells--Growth, growth of cell, growth cell


Name:  polyglutamine tract
CUI: C0032500, Name: Polyglutamic Acid
Definition: A peptide that is a homopolymer of glutamic acid.
TUI(s): T116
Aliases: (total: 5): 
	 Polyglutamic Acid, polyglutamic acid, L-Glutamic acid, homopolymer, L-Glutamic acid, homopolymer, Polyglutamic Acid [Chemical/Ingredient]


Name:  androgen receptor
CUI: C1367578, Name: AR gene
Definition: This gene plays a role in the transcriptional activation of androgen responsive genes.
TUI(s): T028
Aliases (abbreviated, total: 21): 
	 AR gene, AR gene, AR Gene, ANDROGEN RECEPTOR, androgen receptor, androgen receptor, testicular feminization, DIHYDROTESTOSTERONE RECEPTOR, AR, AR


Name:  AR
CUI: C1367578, Name: AR gene
Definition: This gene plays a role in the transcriptional activation of androgen responsive genes.
TUI(s): T028
Aliases (abbreviated, total: 21): 
	 AR gene, AR gene, AR Gene, ANDROGEN RECEPTOR, androgen receptor, androgen receptor, testicular feminization, DIHYDROTESTOSTERONE RECEPTOR, AR, AR


Name:  SBMA
CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the gene encoding the ANDROGEN RECEPTOR.
TUI(s): T047
Aliases (abbreviated, total: 50): 
	 Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, Atrophies, X-Linked Bulbo-Spinal, Bulbo Spinal Atrophy, X Linked, Bulbo-Spinal Atrophies, X-Linked, X-Linked Bulbo-Spinal Atrophies, Atrophy, X-Linked Bulbo-Spinal, X Linked Bulbo Spinal Atrophy, X-Linked Bulbo-Spinal Atrophy, X-Linked Bulbo-Spinal Atrophy

5. Visualisation

In [21]:
displacy.render(doc, style="ent", jupyter = True)
In [47]:
umls_df = pd.DataFrame(all_umls_data)
In [52]:
umls_df
Out[52]:
concept_id canonical_name aliases types definition
0 C0521329 spinal [Spinal, Spinal, Spinal] [T082] Of or relating to the spine or spinal cord.
1 C1839259 Bulbo-Spinal Atrophy, X-Linked [Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal … [T047] An X-linked recessive form of spinal muscular …
2 C1839259 Bulbo-Spinal Atrophy, X-Linked [Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal … [T047] An X-linked recessive form of spinal muscular …
3 C0439660 Hereditary [Hereditary, Hereditary, Hereditary, HEREDITAR… [T169] Transmitted from parent to child by informatio…
4 C0085084 Motor Neuron Disease [Motor Neuron Disease, disease motor neuron, M… [T047] Diseases characterized by a selective degenera…
5 C0007595 cell growth [cell growth, cell growth, cell growth, cell g… [T043] The process in which a cell irreversibly incre…
6 C0032500 Polyglutamic Acid [Polyglutamic Acid, polyglutamic acid, L-Gluta… [T116] A peptide that is a homopolymer of glutamic acid.
7 C1367578 AR gene [AR gene, AR gene, AR Gene, ANDROGEN RECEPTOR,… [T028] This gene plays a role in the transcriptional …
8 C1367578 AR gene [AR gene, AR gene, AR Gene, ANDROGEN RECEPTOR,… [T028] This gene plays a role in the transcriptional …
9 C1839259 Bulbo-Spinal Atrophy, X-Linked [Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal … [T047] An X-linked recessive form of spinal muscular …
In [53]:
concept_entity_df
Out[53]:
concept_id entity
0 C0521329 (Spinal)
1 C1839259 (bulbar, muscular, atrophy)
2 C1839259 (SBMA)
3 C0439660 (inherited)
4 C0085084 (motor, neuron, disease)
5 C0007595 (expansion)
6 C0032500 (polyglutamine, tract)
7 C1367578 (androgen, receptor)
8 C1367578 (AR)
9 C1839259 (SBMA)
In [44]:
concept_entity_df = pd.DataFrame(concept_entity, columns = ['concept_id', 'entity'])
In [65]:
overall_df = pd.concat([concept_entity_df['entity'], umls_df], axis = 1)
In [66]:
overall_df
Out[66]:
entity concept_id canonical_name aliases types definition
0 (Spinal) C0521329 spinal [Spinal, Spinal, Spinal] [T082] Of or relating to the spine or spinal cord.
1 (bulbar, muscular, atrophy) C1839259 Bulbo-Spinal Atrophy, X-Linked [Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal … [T047] An X-linked recessive form of spinal muscular …
2 (SBMA) C1839259 Bulbo-Spinal Atrophy, X-Linked [Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal … [T047] An X-linked recessive form of spinal muscular …
3 (inherited) C0439660 Hereditary [Hereditary, Hereditary, Hereditary, HEREDITAR… [T169] Transmitted from parent to child by informatio…
4 (motor, neuron, disease) C0085084 Motor Neuron Disease [Motor Neuron Disease, disease motor neuron, M… [T047] Diseases characterized by a selective degenera…
5 (expansion) C0007595 cell growth [cell growth, cell growth, cell growth, cell g… [T043] The process in which a cell irreversibly incre…
6 (polyglutamine, tract) C0032500 Polyglutamic Acid [Polyglutamic Acid, polyglutamic acid, L-Gluta… [T116] A peptide that is a homopolymer of glutamic acid.
7 (androgen, receptor) C1367578 AR gene [AR gene, AR gene, AR Gene, ANDROGEN RECEPTOR,… [T028] This gene plays a role in the transcriptional …
8 (AR) C1367578 AR gene [AR gene, AR gene, AR Gene, ANDROGEN RECEPTOR,… [T028] This gene plays a role in the transcriptional …
9 (SBMA) C1839259 Bulbo-Spinal Atrophy, X-Linked [Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal … [T047] An X-linked recessive form of spinal muscular …
Ryan

Ryan

Data Scientist

Leave a Reply