July 30, 2016

Authorship Attribution of WhatsApp Users through Lexical Analysis

Imagine that you’re being messaged by a stranger. Your gut says that it’s someone you already know. How do you find which one of your diabolical friends is pulling this out.

Problem #

Let’s generalize the problem. Given chat histories of 2 users and a large enough corpus from an anonymous user, we want to find which user has a writing style which is similar to that of the anonymous user.
It’s a good ol’ binary classification problem.

To recognize the person we can analyze different features of their writing styles.

Possible approaches #

Lexical features
Syntactic features
Bag of words model

Lexical analysis is what we’ll be using here. We can analyze certain properties like sentence length variation, number of words per sentence, use of punctuation, emoticons etc..

Syntactically we can analyze the use of nouns, pronouns, adverbs, singular/plural words and so on.

Bag of words model is that we disregard the structure of the sentences and look at the use of words.

Solution #

So here’s what we’re going to do

Extract text from chat histories.
Analyze text
Prepare the model
Classify

Step 1:
Following is the code I used to extract the messages from a given user.

import glob
import os

source = "~/projects/python/nlp/attribution/corpus/chats"
files = sorted(glob.glob(os.path.join(source, "archive.txt")))

hasanga = []
pasan = []

def extractMessage(user, msg):
    global hasanga, pasan
    msg = msg.replace('\n', ' ')
    msg = msg.replace('<Media omitted>', '')
    if user == "Pasan Missaka Ucsc":
        pasan.append(msg)
    elif user == "Hasanga":
        hasanga.append(msg)


for fn in files:
    with open(fn) as f:
        txt = f.readlines()
        for line in txt:
            time, msg = line.split('M -')
            if msg.count(':') == 1:
                user, msg = msg.split(':')
                user = user.strip()
                extractMessage(user, msg)

Step 2 and 3:

Then we extract the features from the text. We’ll look at average number of words per sentence, standard deviation of sentence length, and the diversity of words. Finding them is pretty easy with a nlp library like nltk.

txt = [pasan, hasanga, unknown]
feature_matrix = np.zeros((len(txt), 3), np.float64)
for i, ch_text in enumerate(txt):
        ch_text = ch_text.decode('utf-8')
        lowercased = ch_text.lower()

        #tokenize the words
        tokens = nltk.word_tokenize(lowercased)
        words = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(lowercased)
        sentences = sentence_tokenizer.tokenize(ch_text)

        #get the unique words
        vocab = set(words)

        #words per sentence
        wps = np.array([len(word_tokenizer.tokenize(s))
                                       for s in sentences])

        # average number of words per sentence
        feature_matrix[i, 0] = wps.mean()
        # standard deviation
        feature_matrix[i, 1] = wps.std()
        # diversity
        feature_matrix[i, 2] = len(vocab) / float(len(words))

#normalize the features
feature_matrix = whiten(feature_matrix)

Step 4
Let’s use scikit-learn‘s implementation of k-means algorithm to do a binary classification

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, init='k-means++', n_init=10, verbose=1)
print km.fit(list(feature_matrix)).labels_

And you’ll get something like 1 1 0

So the first 2 texts belongs to the same user.

Limitations #

For the above approach to give good results the corpus should be fairly large. WhatsApp being a short messaging service, there’s a chance we have only a limited amount of messages.

This is where things get all hazy. We’ll probably have to consider multiple features along with sample specific features (say emoticons) without relying on just lexical features.

Kudos