en
Tutorials
14. NLP

14. NLP (Natural Language Processing)

TF-IDF, Word2Vec, Sentiment Analysis


Learning Objectives

After completing this tutorial, you will be able to:

  • Understand and apply text preprocessing techniques
  • Implement text vectorization methods (BoW, TF-IDF)
  • Perform traditional ML sentiment analysis (Naive Bayes, SVM, Logistic Regression)
  • Understand concepts of deep learning sentiment analysis (LSTM, Transformer)
  • Build practical review analysis pipelines

NLP Basic Concepts

Text Data Characteristics

CharacteristicDescription
Unstructured dataNo structure
High-dimensionalDimensions equal to vocabulary size
SparsityMostly zeros (sparse)
Order mattersWord order has meaning

NLP Pipeline

Raw Text → Preprocessing → Tokenization → Vectorization → Model Training → Prediction

1. Text Preprocessing

Text preprocessing is the most important first step in NLP. Transforms raw text into a form that models can process.

Basic Preprocessing Steps

StepDescriptionExample
LowercaseUnify case"Movie" → "movie"
Remove special charsRemove punctuation"great!" → "great"
Handle numbersRemove or convert"2024" → ""
Remove stopwordsRemove meaningless words"the", "is", "a", etc.
StemmingExtract stem"running" → "run"
LemmatizationExtract lemma"better" → "good"
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
 
# Basic preprocessing function
def preprocess_text(text):
    # Lowercase
    text = text.lower()
 
    # Remove special characters
    text = re.sub(r'[^a-z\s]', '', text)
 
    # Multiple spaces to single
    text = re.sub(r'\s+', ' ', text).strip()
 
    return text
 
# Tokenization
tokens = word_tokenize("This is a sample sentence.")
 
# Stop words removal
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w.lower() not in stop_words]
 
# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
 
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]
⚠️

Be careful when removing stopwords for sentiment analysis! Negative words like "not", "no" are important words that reverse sentiment. Careless removal can degrade performance.


2. Bag of Words (BoW)

The most basic method of representing word frequency as vectors.

"I love this movie"  →  [1, 1, 1, 1, 0, 0, ...]
"I hate this movie"  →  [1, 0, 1, 1, 1, 0, ...]

Words: [I, love, this, movie, hate, ...]
from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
 
print(f'BoW matrix shape: {X.shape}')
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')
print(vectorizer.get_feature_names_out())

3. TF-IDF

TF-IDF calculates word importance more meaningfully than simple frequency.

Formula

TF (Term Frequency): Word frequency in document

TF(t,d)=times word t appears in document dtotal words in document dTF(t, d) = \frac{\text{times word t appears in document d}}{\text{total words in document d}}

IDF (Inverse Document Frequency): Inverse document frequency

IDF(t)=logtotal documentsdocuments containing word tIDF(t) = \log\frac{\text{total documents}}{\text{documents containing word t}}

TF-IDF = TF x IDF

TF-IDF core: Common words (the, is) get low scores, words in specific documents get high scores. This finds important words that distinguish documents.

from sklearn.feature_extraction.text import TfidfVectorizer
 
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(documents)
 
# Check word importance
tfidf_means = pd.DataFrame({
    'word': tfidf.get_feature_names_out(),
    'tfidf_mean': np.array(X.mean(axis=0)).flatten()
}).sort_values('tfidf_mean', ascending=False)

4. Word Embedding

Represents words as dense vectors reflecting semantic similarity.

BoW/TF-IDF Limitations

  • High-dimensional, sparse vectors
  • Cannot reflect semantic similarity between words

Word2Vec Advantages

  • Low-dimensional dense vectors (100~300 dimensions)
  • Reflects semantic similarity
  • Famous example: king - man + woman ≈ queen
from gensim.models import Word2Vec
 
# Training
sentences = [doc.split() for doc in documents]
model = Word2Vec(sentences, vector_size=100, window=5,
                 min_count=2, workers=4)
 
# Similar words
model.wv.most_similar('king')
 
# Word vector
vector = model.wv['king']

Pre-trained Embeddings

# Using pre-trained embeddings like GloVe, FastText
import gensim.downloader as api
 
glove = api.load('glove-wiki-gigaword-100')
vector = glove['computer']

5. Sentiment Analysis Models (Traditional ML)

Naive Bayes

A powerful baseline model for text classification.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
# Pipeline combining preprocessing and model
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', MultinomialNB())
])
 
nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)

Logistic Regression

from sklearn.linear_model import LogisticRegression
 
lr_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000))
])
 
lr_pipeline.fit(X_train, y_train)
y_pred = lr_pipeline.predict(X_test)

SVM

from sklearn.svm import LinearSVC
 
svm_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', LinearSVC())
])
 
svm_pipeline.fit(X_train, y_train)
y_pred = svm_pipeline.predict(X_test)

Using N-grams captures consecutive word patterns. ngram_range=(1, 2) uses both single words (unigrams) and two-word sequences (bigrams). Example: "not good" bigram captures negative meaning well.


6. Deep Learning NLP

LSTM for Sentiment

Recurrent neural network that learns sequence information.

"This movie is not good" vs "This movie is good"
→ Meaning differs by word order
→ LSTM learns sequence information
from tensorflow.keras import models, layers
 
model = models.Sequential([
    layers.Embedding(vocab_size, 128, input_length=max_len),
    layers.LSTM(64, return_sequences=True),
    layers.LSTM(32),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])
 
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

7. Transformer & BERT

Core technology of modern NLP.

Attention Mechanism

  • Learns relationships between all positions in input sequence
  • Parallelizable

BERT (Bidirectional Encoder Representations from Transformers)

  • Pre-trained large-scale language model
  • Solves various NLP tasks through fine-tuning
from transformers import BertTokenizer, TFBertForSequenceClassification
 
# Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(texts, padding=True, truncation=True,
                   return_tensors='tf', max_length=128)
 
# Model
model = TFBertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2)
 
# Fine-tuning
model.compile(optimizer='adam', loss=model.compute_loss,
              metrics=['accuracy'])
model.fit(inputs, labels, epochs=3, batch_size=16)

Using Hugging Face Pipeline

from transformers import pipeline
 
sentiment_pipeline = pipeline('sentiment-analysis')
result = sentiment_pipeline('I love this movie!')
# [{'label': 'POSITIVE', 'score': 0.9998}]

Code Summary (Sentiment Analysis Pipeline)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
 
# Data
reviews = ["This movie is great!", "Terrible waste of time", ...]
labels = [1, 0, ...]
 
# Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('classifier', LogisticRegression(max_iter=1000))
])
 
# Cross-validation
scores = cross_val_score(pipeline, reviews, labels, cv=5)
print(f"Accuracy: {scores.mean():.4f}{scores.std():.4f})")
 
# Train & Predict
pipeline.fit(reviews, labels)
predictions = pipeline.predict(["Amazing film!", "Boring movie"])

Text Representation Comparison

MethodProsCons
BoWSimple, FastSparse, Ignores order
TF-IDFEmphasizes important wordsIgnores order
Word2VecCaptures meaningIgnores context
BERTUnderstands contextSlow, Resource heavy

Selection Guide

SituationRecommended Method
Quick prototypeTF-IDF + LogReg
Semantic similarityWord2Vec
Best performanceBERT
Sequence modelingLSTM

Practical Tips

Text Preprocessing Guide by Situation

SituationPreprocessing Method
Sentiment analysisBe careful removing stopwords (not is important!)
Document classificationTF-IDF + N-gram
Similarity measurementWord Embedding
Large-scale dataBERT Fine-tuning

Performance Improvement Tips

1. Get sufficient data: Recommend at least 1000+ samples, use data augmentation (back-translation, etc.)

2. Experiment with preprocessing: Adjust N-gram range, customize stopword list, compare stemming/lemmatization

3. Model ensemble: Combine multiple models (Stacking, Voting)

4. Use deep learning: Pre-trained models (BERT), Transfer Learning


Interview Questions Preview

  1. What is TF-IDF and how is it calculated?
  2. What's the difference between Word2Vec's CBOW and Skip-gram?
  3. How is BERT different from previous models?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

The notebook additionally covers:

  • Practice with movie review sample data
  • BoW and TF-IDF matrix visualization (heatmap)
  • Model performance comparison (5-Fold Cross Validation)
  • Feature importance analysis (Top 15 positive/negative words)
  • N-gram (Bigram) analysis
  • WordCloud visualization
  • Real-time prediction function for new reviews
  • Word Embedding concept visualization (2D projection)

Previous: 13. CNN


Learning Complete!

You've completed all 14 ML tutorials. For deeper learning: