This project performs sentiment analysis on the IMDB Dataset of 50,000 movie reviews. The dataset is labeled with binary sentiments: positive or negative.
We go through a full machine learning pipeline using Natural Language Processing (NLP) techniques to classify review sentiment.
- File:
IMDB Dataset.csv - Columns:
review: Text content of a user reviewsentiment: Label (positiveornegative)
-
Data Loading
- Read and display the structure of the dataset.
-
Data Preprocessing
- Remove HTML tags, punctuation, stopwords
- Tokenization, lowercasing, and stemming
-
Exploratory Data Analysis
- Sentiment distribution
- Word clouds for positive and negative reviews
- Review length analysis
-
Text Vectorization
- Using TF-IDF for numerical feature extraction
-
Model Training
- Trained a Logistic Regression model
- Achieved accuracy over ~85% on test data
-
Evaluation
- Classification report
- Confusion matrix visualization
- Sentiment distribution bar plot
- Histogram of review lengths
- Confusion matrix heatmap
pip install pandas numpy matplotlib seaborn scikit-learn nltk