Predicting Meyers-Briggs Personality Types with Natural Language Processing

5 min readOct 26, 2020

In recent years, social media has risen to the apex of human expression and communication. It caters to both the individual and the masses alike. The near-limitless modal capacity of these platforms to connect human beings of every walk of life provides us with a plethora of insight into how humans think and represent themselves.

The Meyers-Briggs personality test is based largely on Jung’s theory of psychological types. It consists of an introspective self-report questionnaire which serves to indicate differing psychological preferences in how people perceive the world and make decisions. It divides the spectrum of human personality traits into sixteen possible outcomes comprised of four brackets:

Extroversion (E) or Introversion (I): refers to how people respond and interact with the world around them. Extroverts tend to be social and interactive whereas introverts tend to be more thought-oriented and prefer to recharge by spending time alone.

Sensing (S) or Intuition (N): refers to how people gather information. Sensory oriented people tend to rely on their senses to gather information and often pay more attention to detail and reality. Intuitive types tend towards creativity, abstract theory and pattern recognition.

Thinking (T) or Feeling (F): refers to how people make decisions. Thinkers tend to evaluate the data and use logic to come to an objective conclusion, whereas Feelers tend to place more weight and consideration on people and the emotional implications of the decisions.

Judging (J) or Perceiving (P): refers to how people deal with the world around them and the decisions they make. Those who tend towards judging prefer decisive outcomes and structure/order. Those who tend towards perceiving tend to be more open-minded and adaptable.

The language that we use to express ourselves and communicate our thoughts to others can reveal a lot about how our minds work and the motives and emotions behind our words. Word choice is incredibly powerful and nuanced. The truth of this plays out all around us; in advertising and marketing, political ad campaigns and speeches, academic writing, and music are a few of the most prominent examples.

Natural language processing is a subfield of linguistics, computer science and artificial intelligence which aims to code and interpret natural language. There are many challenges in this field, given the vibrancy and variation of language. The complexity and ambiguity of language make it extremely difficult to transform and predict.

This project explores a more narrow scope of the challenges of NPL, by analyzing social media posts in an attempt to predict a user’s personality type as laid out by the Meyers-Briggs test. There are sixteen possible outcomes for the target feature, which are being predicted based on over 8,000 posts on social media. The target feature was chosen based on the limited number of possible outcomes as well as the psychological element which lends itself to such classification based on verbal expression.

The primary challenge here is pre-processing the data. Many of the posts include links, emojis, slang, misspelt words, and undefined abbreviations. There are many techniques available to us as data scientists for addressing all of these issues, but it can be challenging to determine the “order of operations”.

After much consideration, I began by exploring the distribution of the sixteen personality types and then binarizing them. Next, the HTML tags and URLs were removed (although, with more time perhaps these links could have been scraped and classified based on the content being shared). When processing language it is generally best to have everything all in one case, especially when dealing with social media as opposed to an academic text, so the entirety of the text was transformed to lowercase letters.

The text was cleaned further by removing symbols and punctuation, keeping only the words. It was then lemmatized and stop words were removed for further clarity. Stop words are like fillers; articles such as ‘a’, ‘and’, ‘the’, and so on, which lend nothing to an analysis of personality and word choice. Lemmatization groups words by their root but keeps the context of inflexion, which preserves the meaning and intention behind the word more completely than stemming.

CountVectorizer and TfidfTransformer were then used to count the frequency of each word and transform those counts into vectors which allowed for an easier interpretation of the content and textual analysis. For example, upon my initial completion of this step, I realized that some of the most frequently used words were actually Meyers-Briggs personality types (‘INTP’, ‘ESFP’, etc.) and it was necessary to go back and remove those before proceeding.

The XGBoost model was used to evaluate the performance of the test set by setting an evaluation metric and early stopping. I set the evaluation metric to logloss and early stopping rounds to 10, which helped to avoid overfitting by attempting to automatically select the tipping point where performance on the test dataset starts to decrease. I then evaluated the feature importance for the first indicator to get an idea of the words that were affecting the first personality indicator and why that might be. The first indicator was Extrovert/Introvert which appears to correspond well with the top feature results seen below.

I then configured gradient boosting, and hyperparameter tuning using GridSearchCV and StratifiedKfold to find the best way to configure my model. The resulting test scores were:

IE: Introversion (I) / Extroversion (E) … * IE: Introversion (I) / Extroversion (E) Accuracy: 78.66% NS: Intuition (N) — Sensing (S) … * NS: Intuition (N) — Sensing (S) Accuracy: 86.03% FT: Feeling (F) — Thinking (T) … * FT: Feeling (F) — Thinking (T) Accuracy: 72.23% JP: Judging (J) — Perceiving (P) … * JP: Judging (J) — Perceiving (P) Accuracy: 66.12%

Validation scores were:

* IE: Introversion (I) / Extroversion (E) Accuracy: 78.83%, * NS: Intuition (N) — Sensing (S) Accuracy: 86.03%, * FT: Feeling (F) — Thinking (T) Accuracy: 72.23%, * JP: Judging (J) — Perceiving (P) Accuracy: 66.12%

Finally, I chose real text taken from the posts of an account on Instagram and created a data frame. I set the parameters for xgboost and trained each personality feature individually, fit the model and made a prediction. Interestingly enough, the result matched what I would have expected, though given that the test is self-reported I suppose that I would best be able to judge its accuracy by feeding text from my own posts into the model. Meyers-Briggs outlines each personality type in-depth and if you ever want to evaluate your own, I would recommend going here to read up on your results.

Notebook

Predicting Meyers-Briggs Personality Types with Natural Language Processing

Written by Kvinne Anc

No responses yet