How we scrape, clean and enrich data with Webscraper, DataPrep by Trifacta and Open Refine.
--
We are working on a website that helps you find local courses, workshops and bootcamps. We call it cademy.io. To build the initial version, we decided to scrape the websites of local educators in our city, clean up the data and slap a searchable interface on top of it. This article outlines the process we used to source, clean and enrich the initial list of courses.
Disclaimer: we’ve never done anything like this before, and we’re not data scientists or experts of any type. There must be better ways to do this, and if you think better options please comment. This is just the solution we came up with.
Part 1 — How we scraped web pages
There are plenty of tools for scraping websites. Here’s the ones we actively used so far:
- Instant Data Scraper — this little scraper is great if you need to scrape a list on a website. We needed to navigate around a the website and collect data that was a little less structured, so it didn’t work well for our usecase.
- Import.io — this option is more powerful but the free tier limits you to 1,000 pages and the paid plan was extremely expensive (over $15k p.y.)
- Webscraper.io —this options was free and flexible, which is why we ended up using it. It does have a bit of a learning curve to it though.
A quick google search should give you dozens of other options for you to try if the ones above don’t work for your needs.
After a few busy hours we managed to scrape over 2,500 courses from the 3 biggest course providers in our city. Once we had the raw data in spreadsheets, we needed to figure out a way to make it usable.
Part 2 — How we cleaned our data
Ok, now that we had a lot of raw data we scraped. The problem was that we collected different type of data from different places — for example the course date and price might be formatted very differently depending on the website we sourced that information from. So we needed a way to clean up our data.
We Googled “how to clean data” and we got a wikipedia article on “data cleansing”. In the wikipedia article there’s a section for “tools” and we looked through the options until eventually trying out OpenRefine and Trifacta. We decided to go with Google Dataprep by Trifacta and here’s the best walkthrough we could find for it:
Just minutes after uploading our spreadsheets into Trifacta we were able to gain incredible insights into about the data we had on hands. Here’s the kind of insights we were able to get in the first few minutes of analysing the data:
- What are the biggest course categories.
- Which days are courses most commonly held on.
- How many courses are missing crucial information, such as pricing.
We derived all of this information without any formulae or setup. This information was revealed to us the moment we imported our data thanks to the interactive histograms Trifacta is generating for each of our columns. For example, just by looking at the histograms we could immediately see that the most popular time for a class is 6.30pm and the most popular days are Thursdays.
At this point, we had three spreadsheets of scraped courses that we collected. We imported the spreadsheets into dataprep, used their simple data cleaning functionality to turn our data mush into a delightfully clean and usable recordset. Here’s what that process looks like:
A few tips on using DataPrep by Trifacta:
- When importing our CSV files into Trifacta, the data in every single cell was wrapped in
"quotes"
. Although there’s a formula to remove these from the data, we didn’t want to apply that formula to every single column we had individually. To solve this, we imported our CSV into google sheets and downloaded the data as Tab Separated Values (TSV) rather than a CSV. When importing it into DataPrep as a TSV file, our records would be shown without the quotes. - At one point, DataPrep had a problem where the system would freeze anytime we tried to edit our “recipe”. We couldn’t access or edit any of our data. We were stuck. The page would load to “99%” and then freeze. After spending a couple of days trying to figure out what we were doing wrong, we emailed support@trifacta.com and they solved the problem from their side in just a few minutes. If you experience issues, try reaching out to them as they will do a great job investigating the issue in order to solve it.
Part 3 — How we enriched our data with third party APIs
Now we had a clean set of 2,500 local courses. We wanted to put them into an interface that you could search. The only problem is that a basic list of courses is pretty dull. We wanted to add some pictures to our listings to make the courses more enticing. To do that, we wanted to query the Unsplash API and get a free stock image based on the title of the course. The problem was that most courses were named something like: Foundation Apprenticeship In Information Technology: Software Development (Scp)
, which were returning no results from Unsplash. Our solution was to first use the IBM Watson keyword extraction API to pick a keyword out of the title of the course, which we’d later use to generate the thumbnail. There’s many ways to do this in code, and we also thought about writing up the script as an App Script in Google Sheets. But the quickest method we came with was to back to OpenRefine, the tool we tested when trying to figure out how to clean up our data. There’s a feature in it that allows you to query external APIs based on your existing columns, here’s how it works:
In our case, the three APIs we ended up querying data from are:
- IBM watson
- Unsplash
- Google Maps geocoding API (to get coordinates to use for location searching later on)
Here’s screenshots of how we configured OpenRefine to make those requests.
Putting it all together
Once we had all our data cleaned and enriched, we imported it into Algolia to make it searchable. We built a basic website in Webflow to house our search feature and here’s what the end result ended up looking like:
An unexpected problem is that stock photos sometimes don’t match the description of the course, but we see that as a funny quirk more than an actual problem . If you search for a German class in Edinburgh you will see quite a few pictures of German shepherds 😅. It’s not perfect , but it works for most courses, and we’re pretty happy about it 🙂.
If you have any questions, please post a comment below or reach me on Twitter. You can also email me on adrian@cademy.io and I will try to help the best I can. Thanks 👋