14. NLP (Natural Language Processing)

TF-IDF, Word2Vec, Sentiment Analysis

Learning Objectives

After completing this tutorial, you will be able to:

Understand and apply text preprocessing techniques
Implement text vectorization methods (BoW, TF-IDF)
Perform traditional ML sentiment analysis (Naive Bayes, SVM, Logistic Regression)
Understand concepts of deep learning sentiment analysis (LSTM, Transformer)
Build practical review analysis pipelines

NLP Basic Concepts

Text Data Characteristics

Characteristic	Description
Unstructured data	No structure
High-dimensional	Dimensions equal to vocabulary size
Sparsity	Mostly zeros (sparse)
Order matters	Word order has meaning

NLP Pipeline

Raw Text → Preprocessing → Tokenization → Vectorization → Model Training → Prediction

1. Text Preprocessing

Text preprocessing is the most important first step in NLP. Transforms raw text into a form that models can process.

Basic Preprocessing Steps

Step	Description	Example
Lowercase	Unify case	"Movie" → "movie"
Remove special chars	Remove punctuation	"great!" → "great"
Handle numbers	Remove or convert	"2024" → ""
Remove stopwords	Remove meaningless words	"the", "is", "a", etc.
Stemming	Extract stem	"running" → "run"
Lemmatization	Extract lemma	"better" → "good"

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
 
# Basic preprocessing function
def preprocess_text(text):
    # Lowercase
    text = text.lower()
 
    # Remove special characters
    text = re.sub(r'[^a-z\s]', '', text)
 
    # Multiple spaces to single
    text = re.sub(r'\s+', ' ', text).strip()
 
    return text
 
# Tokenization
tokens = word_tokenize("This is a sample sentence.")
 
# Stop words removal
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w.lower() not in stop_words]
 
# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
 
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]

⚠️

Be careful when removing stopwords for sentiment analysis! Negative words like "not", "no" are important words that reverse sentiment. Careless removal can degrade performance.

2. Bag of Words (BoW)

The most basic method of representing word frequency as vectors.

"I love this movie"  →  [1, 1, 1, 1, 0, 0, ...]
"I hate this movie"  →  [1, 0, 1, 1, 1, 0, ...]

Words: [I, love, this, movie, hate, ...]

from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
 
print(f'BoW matrix shape: {X.shape}')
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')
print(vectorizer.get_feature_names_out())

3. TF-IDF

TF-IDF calculates word importance more meaningfully than simple frequency.

Formula

TF (Term Frequency): Word frequency in document

$TF(t, d) = \frac{\text{times word t appears in document d}}{\text{total words in document d}}$

IDF (Inverse Document Frequency): Inverse document frequency

$IDF(t) = \log\frac{\text{total documents}}{\text{documents containing word t}}$

TF-IDF = TF x IDF

TF-IDF core: Common words (the, is) get low scores, words in specific documents get high scores. This finds important words that distinguish documents.

from sklearn.feature_extraction.text import TfidfVectorizer
 
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(documents)
 
# Check word importance
tfidf_means = pd.DataFrame({
    'word': tfidf.get_feature_names_out(),
    'tfidf_mean': np.array(X.mean(axis=0)).flatten()
}).sort_values('tfidf_mean', ascending=False)

4. Word Embedding

Represents words as dense vectors reflecting semantic similarity.

BoW/TF-IDF Limitations

High-dimensional, sparse vectors
Cannot reflect semantic similarity between words

Word2Vec Advantages

Low-dimensional dense vectors (100~300 dimensions)
Reflects semantic similarity
Famous example: king - man + woman ≈ queen

from gensim.models import Word2Vec
 
# Training
sentences = [doc.split() for doc in documents]
model = Word2Vec(sentences, vector_size=100, window=5,
                 min_count=2, workers=4)
 
# Similar words
model.wv.most_similar('king')
 
# Word vector
vector = model.wv['king']

Pre-trained Embeddings

# Using pre-trained embeddings like GloVe, FastText
import gensim.downloader as api
 
glove = api.load('glove-wiki-gigaword-100')
vector = glove['computer']

5. Sentiment Analysis Models (Traditional ML)

Naive Bayes

A powerful baseline model for text classification.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
# Pipeline combining preprocessing and model
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', MultinomialNB())
])
 
nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)

Logistic Regression

from sklearn.linear_model import LogisticRegression
 
lr_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000))
])
 
lr_pipeline.fit(X_train, y_train)
y_pred = lr_pipeline.predict(X_test)

SVM

from sklearn.svm import LinearSVC
 
svm_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', LinearSVC())
])
 
svm_pipeline.fit(X_train, y_train)
y_pred = svm_pipeline.predict(X_test)

Using N-grams captures consecutive word patterns. ngram_range=(1, 2) uses both single words (unigrams) and two-word sequences (bigrams). Example: "not good" bigram captures negative meaning well.

6. Deep Learning NLP

LSTM for Sentiment

Recurrent neural network that learns sequence information.

"This movie is not good" vs "This movie is good"
→ Meaning differs by word order
→ LSTM learns sequence information

from tensorflow.keras import models, layers
 
model = models.Sequential([
    layers.Embedding(vocab_size, 128, input_length=max_len),
    layers.LSTM(64, return_sequences=True),
    layers.LSTM(32),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])
 
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

7. Transformer & BERT

Core technology of modern NLP.

Attention Mechanism

Learns relationships between all positions in input sequence
Parallelizable

BERT (Bidirectional Encoder Representations from Transformers)

Pre-trained large-scale language model
Solves various NLP tasks through fine-tuning

from transformers import BertTokenizer, TFBertForSequenceClassification
 
# Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(texts, padding=True, truncation=True,
                   return_tensors='tf', max_length=128)
 
# Model
model = TFBertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2)
 
# Fine-tuning
model.compile(optimizer='adam', loss=model.compute_loss,
              metrics=['accuracy'])
model.fit(inputs, labels, epochs=3, batch_size=16)

Using Hugging Face Pipeline

from transformers import pipeline
 
sentiment_pipeline = pipeline('sentiment-analysis')
result = sentiment_pipeline('I love this movie!')
# [{'label': 'POSITIVE', 'score': 0.9998}]

Code Summary (Sentiment Analysis Pipeline)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
 
# Data
reviews = ["This movie is great!", "Terrible waste of time", ...]
labels = [1, 0, ...]
 
# Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('classifier', LogisticRegression(max_iter=1000))
])
 
# Cross-validation
scores = cross_val_score(pipeline, reviews, labels, cv=5)
print(f"Accuracy: {scores.mean():.4f} (±{scores.std():.4f})")
 
# Train & Predict
pipeline.fit(reviews, labels)
predictions = pipeline.predict(["Amazing film!", "Boring movie"])

Text Representation Comparison

Method	Pros	Cons
BoW	Simple, Fast	Sparse, Ignores order
TF-IDF	Emphasizes important words	Ignores order
Word2Vec	Captures meaning	Ignores context
BERT	Understands context	Slow, Resource heavy

Selection Guide

Situation	Recommended Method
Quick prototype	TF-IDF + LogReg
Semantic similarity	Word2Vec
Best performance	BERT
Sequence modeling	LSTM

Practical Tips

Text Preprocessing Guide by Situation

Situation	Preprocessing Method
Sentiment analysis	Be careful removing stopwords (not is important!)
Document classification	TF-IDF + N-gram
Similarity measurement	Word Embedding
Large-scale data	BERT Fine-tuning

Performance Improvement Tips

1. Get sufficient data: Recommend at least 1000+ samples, use data augmentation (back-translation, etc.)

2. Experiment with preprocessing: Adjust N-gram range, customize stopword list, compare stemming/lemmatization

3. Model ensemble: Combine multiple models (Stacking, Voting)

4. Use deep learning: Pre-trained models (BERT), Transfer Learning

Interview Questions Preview

What is TF-IDF and how is it calculated?
What's the difference between Word2Vec's CBOW and Skip-gram?
How is BERT different from previous models?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

The notebook additionally covers:

Practice with movie review sample data
BoW and TF-IDF matrix visualization (heatmap)
Model performance comparison (5-Fold Cross Validation)
Feature importance analysis (Top 15 positive/negative words)
N-gram (Bigram) analysis
WordCloud visualization
Real-time prediction function for new reviews
Word Embedding concept visualization (2D projection)

Previous: 13. CNN

Learning Complete!

You've completed all 14 ML tutorials. For deeper learning:

Premium Practice - Practice Problems
Premium Solutions - Solutions
Premium Interviews (opens in a new tab) - Interview Prep

13. CNN Download Notebooks