A Short K-Means Clustering using sk-learn and NLTK

K-means Clustering using TFIDF from sk-learn and NLTK. The process is as follows:

  1. Import dependencies and read in data files
  2. Process text
  3. TFIDF vectorisation
  4. KMeans clustering using sk-learn
  5. Inference

1. Import dependencies + Read Data Files

In [1]:
import pandas as pd

import string
import collections
 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer  
from nltk.corpus import stopwords
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint
1a. Read data.csv
In [2]:
data = pd.read_csv('data.csv')
In [3]:
data.head()
Out[3]:
Class Name
0 1 E. D. Abbott Ltd
1 1 Schwan-Stabilo
2 1 Q-workshop
3 1 Marvell Software Solutions Israel
4 1 Bergan Mercy Medical Center
1b. Read classes.txt
In [5]:
classes = []
with open('classes.txt', 'r') as f:
    class_data = f.readline()
    
    i = 1
    while class_data:
        classes.append((i, class_data.rstrip()))
        class_data = f.readline()
        i += 1
In [6]:
classes_df = pd.DataFrame(classes, columns = ['Class', 'Class_Names'])
In [7]:
classes_df
Out[7]:
Class Class_Names
0 1 Company
1 2 EducationalInstitution
2 3 Artist
3 4 Athlete
4 5 OfficeHolder
5 6 MeanOfTransportation
6 7 Building
7 8 NaturalPlace
8 9 Village
9 10 Animal
10 11 Plant
11 12 Album
12 13 Film
13 14 WrittenWork

2. KMeans Clustering

In [31]:
def process_text(text, lemmatisation=True):
    """ Tokenize text and lematise words removing punctuation """
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
 
    if lemmatisation:
        lemmatizer = WordNetLemmatizer() 
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
 
    return tokens
In [54]:
texts = data['Name']
clusters = 14
In [55]:
vectorizer = TfidfVectorizer(tokenizer=process_text,
                             stop_words=stopwords.words('english'),
                             max_df=0.5,
                             lowercase=True)

tfidf_model = vectorizer.fit_transform(texts)
km_model = KMeans(n_clusters=clusters)
km_model.fit(tfidf_model)

clustering = collections.defaultdict(list)

for idx, label in enumerate(km_model.labels_):
    clustering[label].append(idx)
In [69]:
from collections import Counter
In [70]:
Counter(km_model.labels_.tolist())
Out[70]:
Counter({4: 1921,
         9: 440811,
         0: 23428,
         12: 12752,
         5: 1642,
         6: 6064,
         3: 20112,
         2: 4327,
         11: 2059,
         13: 1034,
         1: 6526,
         10: 13969,
         8: 12402,
         7: 1740})
In [65]:
lines_for_predicting = ["The Worst Band in the Universe", "Bergan Mercy Medical Center", "Marvell Software Solutions Israel"]
km_model.predict(vectorizer.transform(lines_for_predicting))
Out[65]:
array([9, 9, 9], dtype=int32)
Ryan

Ryan

Data Scientist

Leave a Reply