Skip to content

Prototype analyzing music reviews from informal user comments, using Amazon reviews as a stand-in dataset. Compares a traditional ML pipeline (Scikit-learn + GridSearchCV) against a deep learning model (GloVe + LSTM) to assess viability, covering data loading, preprocessing, model training, evaluation, and visualization.

Notifications You must be signed in to change notification settings

tfayemi/review-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Below is a sample README.md that demonstrates a professional and comprehensive style. Feel free to adjust the language, sections, and content to suit your personal style and project requirements.


Review Classifier

A comprehensive project for classifying and predicting the helpfulness of Amazon reviews. This repository implements two distinct approaches:

  1. Traditional Machine Learning using Scikit-learn pipelines and GridSearchCV.
  2. Deep Learning using pre-trained GloVe embeddings and an LSTM-based neural network.

Live Demo (Optional): If you have a live demo or a Colab notebook, include a link here.


Table of Contents

  1. Overview
  2. Features
  3. Project Structure
  4. Installation
  5. Usage
  6. Technical Details
  7. Results
  8. Contributing
  9. License
  10. Contact

Overview

This project explores how to determine if an Amazon review is “helpful” or “not helpful” by leveraging:

  • Data Processing: Cleans and preprocesses review data, calculates a “helpfulness ratio,” and splits data for training/testing.
  • Traditional ML: Uses Scikit-learn’s pipelines (CountVectorizer, TfidfTransformer, SGDClassifier) and performs hyperparameter tuning with GridSearchCV.
  • Deep Learning: Integrates pre-trained GloVe embeddings and trains an LSTM network to classify review helpfulness.

Whether you’re a data science enthusiast or a professional engineer, this repository demonstrates how to combine classical ML methods with modern deep learning techniques to tackle real-world text classification problems.


Features

  • End-to-End Pipeline: From data loading to model evaluation, all steps are streamlined.
  • GridSearchCV: Automatic hyperparameter tuning for the traditional ML pipeline.
  • Pre-trained Embeddings: Incorporates GloVe vectors for improved semantic understanding in the LSTM model.
  • Visualization: Displays confusion matrices, accuracy/loss curves, and more.
  • Extensible: Easily add or swap out new models, embeddings, or data.

Project Structure

review-classifier/
├── data/
│   └── sample_dataset.json        # Sample data for testing
├── glove/
│   └── glove.6B.100d.txt          # Pre-trained GloVe embeddings
├── models/
│   └── helpfulness_prediction_model.hdf5  # (Optional) Saved DL model
├── notebooks/
│   └── exploratory_analysis.ipynb # Jupyter notebook for initial EDA
├── src/
│   ├── __init__.py                # Marks src as a package
│   ├── data_processing.py         # Data loading & preprocessing
│   ├── deep_learning.py           # GloVe + LSTM model definitions
│   ├── traditional_ml.py          # Scikit-learn pipelines & evaluation
│   └── utils.py                   # Utility functions (e.g., cosine similarity)
├── main.py                        # Entry point to run the entire pipeline
├── requirements.txt               # Project dependencies
└── README.md                      # Project documentation (this file)

Key Modules

  • data_processing.py: Loads JSON data, computes helpfulness ratio, splits data into train/test sets.
  • traditional_ml.py: Defines and tunes an SGDClassifier pipeline, evaluates performance via confusion matrix and classification report.
  • deep_learning.py: Reads GloVe embeddings, builds an LSTM model, and provides methods for converting text to indices.
  • utils.py: Houses general-purpose helper functions (e.g., cosine_similarity).

Installation

  1. Clone this repository

    git clone https://github.com/yourusername/review-classifier.git
    cd review-classifier
  2. Set up a virtual environment (recommended)

    python -m venv venv
    source venv/bin/activate   # On macOS/Linux
    # or venv\Scripts\activate # On Windows
  3. Install dependencies

    pip install -r requirements.txt
  4. Download GloVe Embeddings (if not already in ./glove/)

    • GloVe 6B Data (choose glove.6B.zip and extract the 100d file into ./glove/).

Usage

  1. Prepare Data

    • Place your dataset (JSON file) in the data/ directory.
    • Update file paths in main.py if necessary.
  2. Run the Pipeline

    python main.py
  3. View Results

    • Check the console output for accuracy, confusion matrix, and classification report.
    • For the deep learning model, training/validation curves are plotted in a new window (if enabled in code).
  4. Explore the Notebooks

    • Open notebooks/exploratory_analysis.ipynb in Jupyter for a step-by-step exploration and additional insights.

Technical Details

  • Traditional ML

    • Pipelines: Combines CountVectorizer + TfidfTransformer + SGDClassifier.
    • GridSearchCV: Tunes ngram_range, use_idf, and alpha parameters.
    • Metrics: Accuracy, confusion matrix, classification report.
  • Deep Learning

    • Embedding: Utilizes pre-trained GloVe 100-dimensional vectors.
    • Model: Two-layer LSTM with dropout.
    • Training: Early stopping and model checkpointing.
    • Evaluation: Accuracy, loss curves, optional confusion matrix.

Results

Method Accuracy (Approx.)
SGDClassifier (Best) ~85-90%
LSTM + GloVe Embeds ~88-92%

(Note: These numbers are hypothetical. Replace them with your actual findings.)


Contributing

Contributions are welcome! If you’d like to:

  1. Add new models (e.g., random forests, logistic regression, or transformers).
  2. Improve data preprocessing (e.g., advanced text cleaning, lemmatization).
  3. Optimize hyperparameters or explore new embeddings.

Please fork the repository, make your changes, and open a pull request.


License

This project is licensed under the MIT License. Feel free to use, modify, and distribute this code for personal or commercial projects. Attribution is appreciated.


Contact

Feel free to reach out with any questions or suggestions!


Happy Coding!

About

Prototype analyzing music reviews from informal user comments, using Amazon reviews as a stand-in dataset. Compares a traditional ML pipeline (Scikit-learn + GridSearchCV) against a deep learning model (GloVe + LSTM) to assess viability, covering data loading, preprocessing, model training, evaluation, and visualization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages