14. NLP (Natural Language Processing)
TF-IDF, Word2Vec, Sentiment Analysis
Learning Objectives
After completing this tutorial, you will be able to:
- Understand and apply text preprocessing techniques
- Implement text vectorization methods (BoW, TF-IDF)
- Perform traditional ML sentiment analysis (Naive Bayes, SVM, Logistic Regression)
- Understand concepts of deep learning sentiment analysis (LSTM, Transformer)
- Build practical review analysis pipelines
NLP Basic Concepts
Text Data Characteristics
| Characteristic | Description |
|---|---|
| Unstructured data | No structure |
| High-dimensional | Dimensions equal to vocabulary size |
| Sparsity | Mostly zeros (sparse) |
| Order matters | Word order has meaning |
NLP Pipeline
Raw Text → Preprocessing → Tokenization → Vectorization → Model Training → Prediction1. Text Preprocessing
Text preprocessing is the most important first step in NLP. Transforms raw text into a form that models can process.
Basic Preprocessing Steps
| Step | Description | Example |
|---|---|---|
| Lowercase | Unify case | "Movie" → "movie" |
| Remove special chars | Remove punctuation | "great!" → "great" |
| Handle numbers | Remove or convert | "2024" → "" |
| Remove stopwords | Remove meaningless words | "the", "is", "a", etc. |
| Stemming | Extract stem | "running" → "run" |
| Lemmatization | Extract lemma | "better" → "good" |
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
# Basic preprocessing function
def preprocess_text(text):
# Lowercase
text = text.lower()
# Remove special characters
text = re.sub(r'[^a-z\s]', '', text)
# Multiple spaces to single
text = re.sub(r'\s+', ' ', text).strip()
return text
# Tokenization
tokens = word_tokenize("This is a sample sentence.")
# Stop words removal
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w.lower() not in stop_words]
# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]Be careful when removing stopwords for sentiment analysis! Negative words like "not", "no" are important words that reverse sentiment. Careless removal can degrade performance.
2. Bag of Words (BoW)
The most basic method of representing word frequency as vectors.
"I love this movie" → [1, 1, 1, 1, 0, 0, ...]
"I hate this movie" → [1, 0, 1, 1, 1, 0, ...]
Words: [I, love, this, movie, hate, ...]from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print(f'BoW matrix shape: {X.shape}')
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')
print(vectorizer.get_feature_names_out())3. TF-IDF
TF-IDF calculates word importance more meaningfully than simple frequency.
Formula
TF (Term Frequency): Word frequency in document
IDF (Inverse Document Frequency): Inverse document frequency
TF-IDF = TF x IDF
TF-IDF core: Common words (the, is) get low scores, words in specific documents get high scores. This finds important words that distinguish documents.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(documents)
# Check word importance
tfidf_means = pd.DataFrame({
'word': tfidf.get_feature_names_out(),
'tfidf_mean': np.array(X.mean(axis=0)).flatten()
}).sort_values('tfidf_mean', ascending=False)4. Word Embedding
Represents words as dense vectors reflecting semantic similarity.
BoW/TF-IDF Limitations
- High-dimensional, sparse vectors
- Cannot reflect semantic similarity between words
Word2Vec Advantages
- Low-dimensional dense vectors (100~300 dimensions)
- Reflects semantic similarity
- Famous example: king - man + woman ≈ queen
from gensim.models import Word2Vec
# Training
sentences = [doc.split() for doc in documents]
model = Word2Vec(sentences, vector_size=100, window=5,
min_count=2, workers=4)
# Similar words
model.wv.most_similar('king')
# Word vector
vector = model.wv['king']Pre-trained Embeddings
# Using pre-trained embeddings like GloVe, FastText
import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-100')
vector = glove['computer']5. Sentiment Analysis Models (Traditional ML)
Naive Bayes
A powerful baseline model for text classification.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Pipeline combining preprocessing and model
nb_pipeline = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
('clf', MultinomialNB())
])
nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)Logistic Regression
from sklearn.linear_model import LogisticRegression
lr_pipeline = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
('clf', LogisticRegression(max_iter=1000))
])
lr_pipeline.fit(X_train, y_train)
y_pred = lr_pipeline.predict(X_test)SVM
from sklearn.svm import LinearSVC
svm_pipeline = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
('clf', LinearSVC())
])
svm_pipeline.fit(X_train, y_train)
y_pred = svm_pipeline.predict(X_test)Using N-grams captures consecutive word patterns. ngram_range=(1, 2) uses both single words (unigrams) and two-word sequences (bigrams). Example: "not good" bigram captures negative meaning well.
6. Deep Learning NLP
LSTM for Sentiment
Recurrent neural network that learns sequence information.
"This movie is not good" vs "This movie is good"
→ Meaning differs by word order
→ LSTM learns sequence informationfrom tensorflow.keras import models, layers
model = models.Sequential([
layers.Embedding(vocab_size, 128, input_length=max_len),
layers.LSTM(64, return_sequences=True),
layers.LSTM(32),
layers.Dense(64, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])7. Transformer & BERT
Core technology of modern NLP.
Attention Mechanism
- Learns relationships between all positions in input sequence
- Parallelizable
BERT (Bidirectional Encoder Representations from Transformers)
- Pre-trained large-scale language model
- Solves various NLP tasks through fine-tuning
from transformers import BertTokenizer, TFBertForSequenceClassification
# Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(texts, padding=True, truncation=True,
return_tensors='tf', max_length=128)
# Model
model = TFBertForSequenceClassification.from_pretrained(
'bert-base-uncased', num_labels=2)
# Fine-tuning
model.compile(optimizer='adam', loss=model.compute_loss,
metrics=['accuracy'])
model.fit(inputs, labels, epochs=3, batch_size=16)Using Hugging Face Pipeline
from transformers import pipeline
sentiment_pipeline = pipeline('sentiment-analysis')
result = sentiment_pipeline('I love this movie!')
# [{'label': 'POSITIVE', 'score': 0.9998}]Code Summary (Sentiment Analysis Pipeline)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Data
reviews = ["This movie is great!", "Terrible waste of time", ...]
labels = [1, 0, ...]
# Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
('classifier', LogisticRegression(max_iter=1000))
])
# Cross-validation
scores = cross_val_score(pipeline, reviews, labels, cv=5)
print(f"Accuracy: {scores.mean():.4f} (±{scores.std():.4f})")
# Train & Predict
pipeline.fit(reviews, labels)
predictions = pipeline.predict(["Amazing film!", "Boring movie"])Text Representation Comparison
| Method | Pros | Cons |
|---|---|---|
| BoW | Simple, Fast | Sparse, Ignores order |
| TF-IDF | Emphasizes important words | Ignores order |
| Word2Vec | Captures meaning | Ignores context |
| BERT | Understands context | Slow, Resource heavy |
Selection Guide
| Situation | Recommended Method |
|---|---|
| Quick prototype | TF-IDF + LogReg |
| Semantic similarity | Word2Vec |
| Best performance | BERT |
| Sequence modeling | LSTM |
Practical Tips
Text Preprocessing Guide by Situation
| Situation | Preprocessing Method |
|---|---|
| Sentiment analysis | Be careful removing stopwords (not is important!) |
| Document classification | TF-IDF + N-gram |
| Similarity measurement | Word Embedding |
| Large-scale data | BERT Fine-tuning |
Performance Improvement Tips
1. Get sufficient data: Recommend at least 1000+ samples, use data augmentation (back-translation, etc.)
2. Experiment with preprocessing: Adjust N-gram range, customize stopword list, compare stemming/lemmatization
3. Model ensemble: Combine multiple models (Stacking, Voting)
4. Use deep learning: Pre-trained models (BERT), Transfer Learning
Interview Questions Preview
- What is TF-IDF and how is it calculated?
- What's the difference between Word2Vec's CBOW and Skip-gram?
- How is BERT different from previous models?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
The notebook additionally covers:
- Practice with movie review sample data
- BoW and TF-IDF matrix visualization (heatmap)
- Model performance comparison (5-Fold Cross Validation)
- Feature importance analysis (Top 15 positive/negative words)
- N-gram (Bigram) analysis
- WordCloud visualization
- Real-time prediction function for new reviews
- Word Embedding concept visualization (2D projection)
Previous: 13. CNN
Learning Complete!
You've completed all 14 ML tutorials. For deeper learning:
- Premium Practice - Practice Problems
- Premium Solutions - Solutions
- Premium Interviews (opens in a new tab) - Interview Prep