How we scrape, clean and enrich data with Webscraper, DataPrep by Trifacta and Open Refine.

Part 1 — How we scraped web pages

  • Instant Data Scraper — this little scraper is great if you need to scrape a list on a website. We needed to navigate around a the website and collect data that was a little less structured, so it didn’t work well for our usecase.
  • Import.io — this option is more powerful but the free tier limits you to 1,000 pages and the paid plan was extremely expensive (over $15k p.y.)
  • Webscraper.io —this options was free and flexible, which is why we ended up using it. It does have a bit of a learning curve to it though.

Part 2 — How we cleaned our data

  • What are the biggest course categories.
  • Which days are courses most commonly held on.
  • How many courses are missing crucial information, such as pricing.
  1. When importing our CSV files into Trifacta, the data in every single cell was wrapped in "quotes". Although there’s a formula to remove these from the data, we didn’t want to apply that formula to every single column we had individually. To solve this, we imported our CSV into google sheets and downloaded the data as Tab Separated Values (TSV) rather than a CSV. When importing it into DataPrep as a TSV file, our records would be shown without the quotes.
  2. At one point, DataPrep had a problem where the system would freeze anytime we tried to edit our “recipe”. We couldn’t access or edit any of our data. We were stuck. The page would load to “99%” and then freeze. After spending a couple of days trying to figure out what we were doing wrong, we emailed support@trifacta.com and they solved the problem from their side in just a few minutes. If you experience issues, try reaching out to them as they will do a great job investigating the issue in order to solve it.

Part 3 — How we enriched our data with third party APIs

  • IBM watson
  • Unsplash
  • Google Maps geocoding API (to get coordinates to use for location searching later on)

Putting it all together

Unlisted

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adrian Binzaru

Adrian Binzaru

18 Followers

Hey 👋, I’m the Co-Founder & CEO of cademy.io — We’re building a marketplace for local courses out of Edinburgh, Scotland. 👉 I post startup stories and tips 👍