Authorship Attribution of WhatsApp Users through Lexical Analysis
Imagine that you’re being messaged by a stranger. Your gut says that it’s someone you already know. How do you find which one of your diabolical friends is pulling this out.
Problem
Let’s generalize the problem. Given chat histories of 2 users and a large enough corpus from an anonymous user, we want to find which user has a writing style which is similar to that of the anonymous user.
It’s a good ol’ binary classification problem.
To recognize the person we can analyze different features of their writing styles.
Possible approaches
Lexical features
Syntactic features
Bag of words model
Lexical analysis is what we’ll be using here. We can analyze certain properties like sentence length variation, number of words per sentence, use of punctuation, emoticons etc..
Syntactically we can analyze the use of nouns, pronouns, adverbs, singular/plural words and so on.
Bag of words model is...