Dear fellow Data Scientists,
Understanding how people feel about products, services, or events is a challenge many businesses face. Sentiment analysis offers a way to process text data — such as social media posts, reviews, and comments — and turn it into useful insights. It helps organizations measure public opinion and adapt to customer needs.
In this article, written by 👋 Ryan Jewik, we’ll explore sentiment analysis methods, including rule-based systems like VADER and machine learning models like Roberta. We’ll also look at common challenges, such as sarcasm and idioms, and discuss strategies to address them for better results.
Fundamentals of Sentiment Analysis
Extracting customer opinions or public sentiment from text used to be complex due to the intricacies of language. Sentiment analysis addresses this by transforming unstructured inputs, like product reviews or social media posts, into measurable insights. Tools such as VADER assist with straightforward tasks, while advanced models like Roberta handle subtle interpretations.
Your social media posts, comments, replies, and reviews all count as valuable data. Businesses rely on this information to make decisions and maximize profits. Even video and audio data can be processed for insights, but analyzing human language brings its own challenges. Language evolves constantly, varies by dialect and spelling, and is filled with subtle nuances that can confuse machines.
This is where Natural Language Processing (NLP) becomes indispensable. NLP processes raw text into a structured format that machines can interpret. Key steps include:
- Removing stop words: Eliminating common, non-informative words like “a,” “the,” or “to.”
- Tokenization: Breaking down text into smaller units like words or phrases.
Sentiment analysis identifies emotional tone by analyzing data patterns. It assigns polarity scores on a scale from -100 to +100 and typically labels the text as positive, negative, or neutral.
More sophisticated applications can expand classifications to detect emotions, such as joy or anger, through techniques like tone analysis. These often rely on lexicons — dictionaries that link specific words to corresponding sentiments or emotions.
So how is this applied? Companies frequently analyze sentiment to understand how customers feel about their products or updates. An app might study reviews of its latest version, while businesses could evaluate tweets or comments for insights. These findings help shape upcoming changes or strategies.
But the human language can be very difficult to learn. Problems often arise when text bodies include things such as:
- Idioms: like “break a leg” or “beating around the bush” can confuse a model
- Sarcasm: especially when using lexicons, a review or tweet that uses sarcasm likely be classified improperly
- Negation: for example saying “not bad” might get misinterpreted especially if not tokenized properly
How are these drawbacks dealt with? Earlier I mentioned the use of lexicons, essentially dictionaries of words that are tied to different classifications, which are used when words within those dictionaries are found.
Addressing these challenges requires thoughtful approaches. Lexicon-based (rule-based) systems, which rely on predefined dictionaries, work well for simple cases but often miss contextual subtleties. Machine learning models, on the other hand, adapt better to complexity but demand large, high-quality datasets for effective training. Different models including Linear Regression, Naive Bayes, and Support Vector Machines all have the ability to reduce the impact of the drawbacks of Sentiment Analysis given quality data sets for training. However, quality data is key as it can make the difference between a highly adaptive model that can even understand some sarcasm and a model that is narrow in scope and poorly classifies text bodies.
Hands-On Case Study
Let’s explore sentiment analysis techniques using a practical example: an app review for the Khan Academy Kids app from the Apple App Store. We’ll compare a rule-based approach (VADER) with a machine learning model (Roberta) to understand their strengths and weaknesses.
Preparing the Text
Let’s start with NLTK (Natural Language Toolkit), a popular Python library for Natural Language Processing. Using the app review shown below, we can walk through basic NLP techniques.
Tokenization
Tokenization is the process of splitting text into smaller units, such as words or phrases. This step helps us prepare the text for further analysis.
The review is broken into individual words and punctuation, which the model can now process.
Part-of-Speech (POS) Tagging
NLTK assigns each token a part of speech based on its role in the sentence.
The tags identify nouns, verbs, pronouns, and other grammatical components. For example:
- “Khan” is labeled as a proper noun (
NNP
). - “saved” is recognized as a past-tense verb (
VBD
).
Chunking
Chunking builds upon Part-of-Speech (POS) tagging by grouping related parts of speech into meaningful units, such as noun phrases, verb phrases, or named entities. While POS tagging assigns a grammatical category to each word (e.g., noun or verb), chunking combines these tags to form higher-level structures. For example, a POS tag might label “Khan” as a proper noun (NNP
) and "academy" as a verb (VB
), but chunking can group them together as "Khan Academy" and label it as an organization.
In this example, chunking identifies “Khan Academy” as part of a named entity. This process helps simplify text by recognizing important entities or groups of words, which can then be analyzed for context or sentiment.
📌VADER — A Rule-Based Approach
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool specifically designed to handle social media text and short pieces of text. Unlike NLTK, which focuses on preprocessing and tagging, VADER provides sentiment scores directly. It’s ideal for quick analyses where speed and simplicity are key.
VADER uses a predefined lexicon of words that are assigned positive, negative, or neutral sentiment scores. It also accounts for factors like punctuation, capitalization, and modifiers (e.g., “very,” “so”) to fine-tune its results. This makes it more adaptable to informal language often found in social media or customer reviews.
Let’s analyze two sample sentences using VADER’s polarity_scores
method:
The first sentence, “I am so so happy!”, has a high positive score (pos: 0.642
) and a strong compound score (0.7492
), indicating a clear positive sentiment. In contrast, the second sentence, "I am very angry!", shows a high negative score (neg: 0.66
) and a strongly negative compound score (-0.5974
), reflecting intense negative sentiment. This demonstrates VADER's ability to capture emotional intensity through punctuation and modifiers like "so" or "very."
📌Roberta — A Machine Learning Approach
Roberta, short for Robustly Optimized BERT Pretraining Approach, is a machine learning model that excels at context-aware sentiment analysis. Roberta is a transformer-based model pre-trained on vast amounts of text, such as Twitter data. This allows it to better understand nuances, slang, and context, making it highly effective for complex or informal text analysis.
To use Roberta for sentiment analysis, the text is tokenized and passed through a pre-trained Roberta model. The model outputs probabilities for negative, neutral, and positive sentiment, which are then normalized using a softmax function. The setup for Roberta is straightforward, as shown below:
The helper function, polarity_scores_roberta
, tokenizes the input text, processes it through the model, and retrieves the sentiment scores:
We applied both VADER and Roberta to analyze the sentiment of the same review of the Khan Academy Kids app from the Apple App Store.
The VADER results show a strong positive sentiment with a compound score of 0.8588, but the relatively high neutral score (0.725) indicates some uncertainty in classification. In contrast, Roberta’s results, with an overwhelmingly high positive score (0.985) and a low neutral score (0.012), demonstrate its ability to identify the sentiment more precisely and confidently.
References
- Natural Language Toolkit (NLTK):
Learn more about tokenization, POS tagging, and chunking with NLTK.
https://www.nltk.org/ - Guru99 — POS Tagging and Chunking with NLTK:
A beginner-friendly guide to understanding POS tagging and chunking.
https://www.guru99.com/pos-tagging-chunking-nltk.html - VADER Sentiment Analysis:
A detailed overview of the VADER sentiment analysis tool.
https://github.com/cjhutto/vaderSentiment - Hugging Face — Roberta Model:
Explore Roberta models pre-trained for sentiment analysis.
https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment - Toward Data Science — Tokenization in NLP:
An article discussing different tokenization techniques in NLP.
https://towardsdatascience.com/the-evolution-of-tokenization-in-nlp-byte-pair-encoding-in-nlp-d7621b9c1186 - AltexSoft — Sentiment Analysis Tools and Use Cases:
Overview of sentiment analysis applications in business.
https://www.altexsoft.com/blog/sentiment-analysis-types-tools-and-use-cases/