Authorship Attribution of WhatsApp Users through Lexical Analysis
Imagine that you’re being messaged by a stranger. Your gut says that it’s someone you already know. How do you find which one of your diabolical friends is pulling this out.
Problem #
Let’s generalize the problem. Given chat histories of 2 users and a large enough corpus from an anonymous user, we want to find which user has a writing style which is similar to that of the anonymous user.
It’s a good ol’ binary classification problem.
To recognize the person we can analyze different features of their writing styles.
Possible approaches #
Lexical features
Syntactic features
Bag of words model
Lexical analysis is what we’ll be using here. We can analyze certain properties like sentence length variation, number of words per sentence, use of punctuation, emoticons etc..
Syntactically we can analyze the use of nouns, pronouns, adverbs, singular/plural words and so on.
Bag of words model is that we disregard the structure of the sentences and look at the use of words.
Solution #
So here’s what we’re going to do
- Extract text from chat histories.
- Analyze text
- Prepare the model
- Classify
Step 1:
Following is the code I used to extract the messages from a given user.
import glob
import os
source = "~/projects/python/nlp/attribution/corpus/chats"
files = sorted(glob.glob(os.path.join(source, "archive.txt")))
hasanga = []
pasan = []
def extractMessage(user, msg):
global hasanga, pasan
msg = msg.replace('\n', ' ')
msg = msg.replace('<Media omitted>', '')
if user == "Pasan Missaka Ucsc":
pasan.append(msg)
elif user == "Hasanga":
hasanga.append(msg)
for fn in files:
with open(fn) as f:
txt = f.readlines()
for line in txt:
time, msg = line.split('M -')
if msg.count(':') == 1:
user, msg = msg.split(':')
user = user.strip()
extractMessage(user, msg)
Step 2 and 3:
Then we extract the features from the text. We’ll look at average number of words per sentence, standard deviation of sentence length, and the diversity of words. Finding them is pretty easy with a nlp library like nltk.
txt = [pasan, hasanga, unknown]
feature_matrix = np.zeros((len(txt), 3), np.float64)
for i, ch_text in enumerate(txt):
ch_text = ch_text.decode('utf-8')
lowercased = ch_text.lower()
#tokenize the words
tokens = nltk.word_tokenize(lowercased)
words = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(lowercased)
sentences = sentence_tokenizer.tokenize(ch_text)
#get the unique words
vocab = set(words)
#words per sentence
wps = np.array([len(word_tokenizer.tokenize(s))
for s in sentences])
# average number of words per sentence
feature_matrix[i, 0] = wps.mean()
# standard deviation
feature_matrix[i, 1] = wps.std()
# diversity
feature_matrix[i, 2] = len(vocab) / float(len(words))
#normalize the features
feature_matrix = whiten(feature_matrix)
Step 4
Let’s use scikit-learn
‘s implementation of k-means algorithm to do a binary classification
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2, init='k-means++', n_init=10, verbose=1)
print km.fit(list(feature_matrix)).labels_
And you’ll get something like 1 1 0
So the first 2 texts belongs to the same user.
Limitations #
For the above approach to give good results the corpus should be fairly large. WhatsApp being a short messaging service, there’s a chance we have only a limited amount of messages.
This is where things get all hazy. We’ll probably have to consider multiple features along with sample specific features (say emoticons) without relying on just lexical features.