Pre-trained fastai LSTM vs uClassify.com predicting MBTI from Reddit texts

4 min readNov 7, 2019

Personality type theory is an elusive field of study. The notion that different people can be categorised into a number of psychological types seems to have been around as long as there have been humans. Some theories have been proven wrong. Some theories have even been dangerous.

Luckily, today in the modern world we tend to rely on more scientific approach to studying personality and the current state of affairs seems to be in favor of the Five-Factor Model of personality, a.k.a O.C.E.A.N or Big Five. It’s origin is actually a meta study of common variables in a large amount of earlier theories.

Today, many would — in line with the promise of data science — also like to see similar findings derived from empirical clusterings of emergent traits in raw human behavioural data i.e. data-driven theory.

Personality type theory is, I would say, a perfect example of our days’ most hyped philosophy of science problem and at the root of all data-science work; hypothesis first or data first?

Training AI to Predict Myers-Briggs Personality Types From Texts

A while ago a friend of mine also interested in mind and machine learning sent me the link to Viridiana Romero Martinez’s post Training AI to Predict Myers-Briggs Personality Types From Texts where she describes a simple way to use self-typed Reddit-users’ posts to predict their type using the simple-to-use fastai library. I wanted to compare the Naive-Bayesian approach with a popular neural network, since I know from before that Naive-Bayesian classifiers often do the job very well when it comes to classifying text.

I did not intend to reproduce Viridiana’s experiment primarily, but to compare a neural network approach with a Naive-Bayesian approach. I also wanted to see if staying more true to psychologist Carl Jung’s original type theory by grouping the 16 MBTI categories into 8 cognitive functions improved the results. For a more balanced training of each category I also made the dataset balanced by adding the same number of texts into each class. This differs from Viridiana’s experiment, but you could of course just skip those steps in order to benchmark with her original experiment.

The full code used for this experiment is found on github, but it might be easier to follow if you look into each folder separately. I skipped the CSV-files containing the texts I downloaded when I ran this, due to privacy concerns and I also skipped the models produced by fastai due to their large file size. You’ll have to run the code yourself to get texts and produce models.

Download texts from subreddits (I got 9272 texts, Viridiana’s run resulted in 8675 texts)
Sanitize and create balanced and unbalanced training sets for comparison.
Train and evaluate fastai’s neural network
Train and evaluate uClassify’s Naive-Bayesian classifier

So how did they compare?

Pre-trained fastai LSTM with transfer learning from Reddit texts

With 8 balanced categories (one for each cognitive function) fastai was the winner with an accuracy of 0.29. with uClassify.com not far behind with an accuracy of 0.25.

So what do we make of all this?

Since fastai was doped with a pre-trained state-of-the-art deep learning model and also fine-tuned by learning Reddit texts from the downloaded dataset, maybe it wasn’t a fair competition after all. Training a neural network with fastai takes a lot more time and training data. More data, better results might be the simple answer to the improvement, but that costs computing resources.

Using uClassify.com is of course dead-simple in comparison with using a complex and versatile library such as fastai and have to write all the code yourself. UClassify.com even comes with a nifty web GUI for all the necessary steps!

Finally, with a .125 chance of guessing the right category at random, neither of the two approaches would be of much use for personality-predicting tasks anyhow, but hopefully we learned some methodological tricks here.

Special thanks to:

Viridiana Romero Martinez for sharing an interesting approach to studying personality type using publicly available data.
Jon Maiga for building uClassify.com to enable easy access to machine learning for a broader audience long before the tech giants did just that.
Mikael Huss for generously giving of your time and always pointing me in a fruitful direction in the world of machine learning and data science.

Pre-trained fastai LSTM vs uClassify.com predicting MBTI from Reddit texts

Written by Mattias Östmar

No responses yet