Art Collection Web Scraping
Project: Web Scraping and Data Analysis of University Art Collection
Project Overview
This project, conducted as part of my Data Science coursework under the guidance of Professor Daniel Turek, demonstrates my proficiency in web scraping, data cleaning, and exploratory data analysis using R. The goal was to extract information from the University of Edinburgh's art collection website, clean and process the data, and analyze trends to uncover insights.
Objectives
Data Extraction: Scrape titles, artists, and links for all records in the collection.
Data Cleaning: Address inconsistencies such as missing artist and year information.
Analysis and Visualization: Uncover trends, identify prolific artists, and investigate temporal distributions.
Key Steps and Skills
Web Scraping with rvest:
Developed a custom scraping function in R to parse HTML pages and extract information.
Used CSS selectors to target specific elements for titles, artist names, and artwork links.
Concatenated and cleaned artist names using string manipulation functions.
Example:
2. Regular Expressions for Data Cleaning:
Extracted year information from titles by splitting strings using parentheses.
Regular expressions identified patterns, such as mismatches in title formatting, to correct errors.
Example:
3. Data Manipulation:
Used dplyr functions to group, filter, and summarize data.
Created new columns, such as position error calculations and Fourier transformations, for enhanced analysis.
Addressed missing data and highlighted inconsistencies for further review.
4. Visualization:
Produced histograms, scatter plots, bar charts, and heatmaps to explore data trends and missing values.
Used ggplot2 for clear and aesthetically pleasing visualizations.
Example of a histogram:
Findings from the University Art Collection
The analysis of the university art museum catalog revealed several key insights and issues, emphasizing the need for data cleaning and exploration:
Data Completeness:
Out of 1,970 records, 68 artworks lack artist information, while 943 records have missing year values, affecting over half the dataset in terms of chronological data. This highlights a significant gap in cataloging consistency.
Parsing Errors:
Parsing inconsistencies were evident, such as in the entry "Death Mask (2) (1964)", where the year was incorrectly extracted as 2 instead of 1964. This underscores the challenges of relying solely on automated string splitting for data extraction.
Historical Trends:
Visualizing the distribution of artwork creation years (excluding missing data) showed a concentration of entries in the mid-to-late 20th century, with a notable peak around the 1960s. This pattern reflects a surge in artistic production or collection acquisition during this period.
Prolific Artists:
Among the 873 unique artists, Emma Gillies emerged as the second most prolific contributor with 111 artworks. Identifying such contributors helps uncover significant contributors to the collection's diversity and themes.
Common Themes:
Using keyword analysis, it was found that only 4 artworks contained the word "child" in their titles. This suggests limited representation of childhood-related themes in the collection.
Data Visualization:
A heatmap of missing data clearly illustrated the concentration of missing values in the "Year" column, providing a visual representation of the dataset's completeness issues.
A scatter plot of the number of artworks over time highlighted sparse representation in the early 19th century compared to more recent periods.
Artist Contributions:
A bar plot of the top 10 most prolific artists revealed a few individuals dominate the collection, suggesting possible focal points in acquisition efforts or donor preferences.
Summary of Findings
The project effectively highlighted the value and challenges of analyzing real-world datasets. Missing and inconsistent data hinder the ability to draw comprehensive conclusions but also emphasize the importance of robust preprocessing methods. The insights into historical trends, artist contributions, and data gaps underscore the potential for enhancing the catalog's utility with targeted data cleaning and enrichment efforts. These findings pave the way for a deeper exploration of thematic representation and acquisition patterns in the collection.
Skills Demonstrated
Web Scraping: Proficient in automating data extraction from websites using the rvest package and HTML manipulation techniques.
Data Cleaning: Effective handling of inconsistencies and missing values using regular expressions and R's string processing tools.
Data Manipulation: Strong command over tidyverse libraries for transforming and summarizing datasets.
Visualization: Adept at creating detailed and insightful plots using ggplot2 to communicate findings effectively.
Programming: Advanced R skills, particularly in the context of text processing, data cleaning, and exploratory analysis.
Areas for Improvement
Error Handling in Web Scraping:
The scraping process could be enhanced by adding robust error handling and retry mechanisms for failed requests.
Parsing Flexibility:
Implement more sophisticated methods for title parsing to reduce errors in extracting year information.
Data Quality Assessment:
Introduce automated tools for assessing the completeness and consistency of scraped data.
Scalability:
Optimize the scraping function to handle larger datasets or additional fields for future use.
Conclusion
This project highlights my ability to extract, clean, and analyze complex datasets, showcasing critical skills in data engineering, data science, and visual storytelling. By addressing inconsistencies in a real-world dataset and deriving actionable insights, I demonstrated a practical understanding of leveraging R for web scraping and analytics.