How we scrape, clean and enrich data with Webscraper, DataPrep by Trifacta and Open Refine.

Part 1 — How we scraped web pages

There are plenty of tools for scraping websites. Here’s the ones we actively used so far:

  • Instant Data Scraper — this little scraper is great if you need to scrape a list on a website. We needed to navigate around a the website and collect data that was a little less structured, so it didn’t work well for our usecase.
  • Import.io — this option is more powerful but the free tier limits you to 1,000 pages and the paid plan was extremely expensive (over $15k p.y.)
  • Webscraper.io —this options was free and flexible, which is why we ended up using it. It does have a bit of a learning curve to it though.

Part 2 — How we cleaned our data

Ok, now that we had a lot of raw data we scraped. The problem was that we collected different type of data from different places — for example the course date and price might be formatted very differently depending on the website we sourced that information from. So we needed a way to clean up our data.

  • What are the biggest course categories.
  • Which days are courses most commonly held on.
  • How many courses are missing crucial information, such as pricing.
  1. When importing our CSV files into Trifacta, the data in every single cell was wrapped in "quotes". Although there’s a formula to remove these from the data, we didn’t want to apply that formula to every single column we had individually. To solve this, we imported our CSV into google sheets and downloaded the data as Tab Separated Values (TSV) rather than a CSV. When importing it into DataPrep as a TSV file, our records would be shown without the quotes.
  2. At one point, DataPrep had a problem where the system would freeze anytime we tried to edit our “recipe”. We couldn’t access or edit any of our data. We were stuck. The page would load to “99%” and then freeze. After spending a couple of days trying to figure out what we were doing wrong, we emailed support@trifacta.com and they solved the problem from their side in just a few minutes. If you experience issues, try reaching out to them as they will do a great job investigating the issue in order to solve it.

Part 3 — How we enriched our data with third party APIs

Now we had a clean set of 2,500 local courses. We wanted to put them into an interface that you could search. The only problem is that a basic list of courses is pretty dull. We wanted to add some pictures to our listings to make the courses more enticing. To do that, we wanted to query the Unsplash API and get a free stock image based on the title of the course. The problem was that most courses were named something like: Foundation Apprenticeship In Information Technology: Software Development (Scp) , which were returning no results from Unsplash. Our solution was to first use the IBM Watson keyword extraction API to pick a keyword out of the title of the course, which we’d later use to generate the thumbnail. There’s many ways to do this in code, and we also thought about writing up the script as an App Script in Google Sheets. But the quickest method we came with was to back to OpenRefine, the tool we tested when trying to figure out how to clean up our data. There’s a feature in it that allows you to query external APIs based on your existing columns, here’s how it works:

  • IBM watson
  • Unsplash
  • Google Maps geocoding API (to get coordinates to use for location searching later on)

Putting it all together

Once we had all our data cleaned and enriched, we imported it into Algolia to make it searchable. We built a basic website in Webflow to house our search feature and here’s what the end result ended up looking like:

Unlisted

--

--

--

Hey 👋, I’m the Co-Founder & CEO of cademy.io — We’re building a marketplace for local courses out of Edinburgh, Scotland. 👉 I post startup stories and tips 👍

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

From drop year to getting a successful job as a data analyst

How Should SMBs Define ‘Big Data’ to Gain ROI?

The Product/Data Fit Strategy

New Feature: Global Industry Codes

How Pierce Transit Used Swiftly to Improve Prediction Accuracy By Up to 50%

Bridging the Gap between Prescriptive Analytics and Trust

Youtube Trending Videos Analysis

5 Data Science Use Cases for Every Business

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adrian Binzaru

Adrian Binzaru

Hey 👋, I’m the Co-Founder & CEO of cademy.io — We’re building a marketplace for local courses out of Edinburgh, Scotland. 👉 I post startup stories and tips 👍