Movie Genre Classifier
Project Description: Movie Poster Genre ClassificationÂ
Overview
This project focuses on classifying movie genres based on poster images using a combination of web scraping, database management, and machine learning. By scraping data from public movie databases, processing poster images into feature vectors, and performing genre classification, the project demonstrates the integration of data engineering and deep learning techniques to solve real-world challenges.
Key Steps
Web Scraping:
Objective: Scrape movie metadata, including titles, genres, and poster image URLs, from IMDb or TMDb.
Implementation:
Used BeautifulSoup and requests to extract data such as titles and genres.
Downloaded poster images for vectorization and analysis.
Outcome: Collected a comprehensive dataset of movie metadata and poster images for further processing.
Data Storage:
Objective: Organize metadata and poster embeddings into a structured SQL database.
Implementation:
Designed a schema with three main tables:
Movies: Includes movie titles and poster images.
Genres: Stores unique genre names.
Movie-Genres Relationship: Links movies to their respective genres.
Stored poster embeddings (256-dimensional feature vectors) in a separate vectors table for efficient retrieval.
Outcome: A scalable database for storing and querying metadata and embeddings.
Image Vectorization:
Objective: Convert poster images into embeddings that capture visual features.
Implementation:
Fine-tuned a pre-trained ResNet34 model to generate 256-dimensional embeddings.
Incorporated data augmentation (resizing, normalization) to improve generalization.
Outcome: Generated embeddings for all poster images, representing them as high-dimensional feature vectors.
Genre Classification:
Objective: Train a neural network to predict movie genres based on poster embeddings.
Implementation:
Used a multi-label classification setup, as movies often belong to multiple genres.
Handled class imbalance by computing and applying appropriate weights during training.
Evaluated performance using metrics such as F1-score, precision, and recall.
Outcome: Developed a model capable of predicting genres with high accuracy.
Vector Search:
Objective: Perform similarity searches to classify or recommend movies based on poster vectors.
Implementation:
Stored embeddings in the SQL database for efficient querying.
Calculated similarities using Euclidean distance between vectors.
Outcome: Enabled genre classification and recommendations for new posters by querying the vector database.
Skills Demonstrated
Web Scraping: Extracted structured data from unstructured web pages using BeautifulSoup and requests.
Database Management: Designed and implemented a relational database schema, handling binary data storage for images.
Deep Learning:
Fine-tuned a CNN for feature extraction and genre classification.
Implemented custom layers to generate embeddings for vector search.
Data Engineering:
Managed large datasets of images and metadata.
Performed data augmentation and preprocessing to ensure model robustness.
Machine Learning: Used metrics like F1-score and recall to assess multi-label classification performance.
Optimization: Leveraged GPU acceleration for model training and batch processing for efficiency.
Key Outcomes
Created a scalable pipeline for scraping, storing, and analyzing movie posters and metadata.
Built a multi-label classification model for genre prediction with robust performance.
Enabled vector-based similarity search for recommendations or classification of new posters.
Next Steps
Expand Dataset: Include more diverse movies and genres to improve model generalization.
Enhance Search: Integrate more advanced vector search frameworks like FAISS or Milvus for scalability.
Deploy Solution: Build an interactive web interface for users to upload posters and get genre recommendations.
This project illustrates a practical application of machine learning in the entertainment domain, showcasing the potential of integrating data engineering with AI solutions.