In one of my previous posts I shared with you how to do basic web scraping with a combination of google sheets, the importxml() formula and xpath.
As soon as you’re project will become a bit bigger and you start to use more and more importxml() queries in your sheet you eventually will be capped by googles limit on the formula. The content just won’t load anymore.
Luckily there are plenty of other options for web scraping and you’ll be able to reuse your knowledge of xpath. Nokogiri is a so called ruby “gem” that you can run in terminal.
An example of what can be done with Nokogiri & Ruby
- Scrape multiple elements of an URL as columns in a CSV
- Scrape multiple (similar) URLs at once
- Save the whole dataset to one CSV
An image tells a thousand words
The scraper in action
Pretty satisfying to watch this script doing it’s job.
Not sharing the code
Unfortunately scraping a website is an extremely “grey” area of the law – see this great writeup on the topic. That’s why I won’t share the code here. Think carefully about the consequences of your actions and ask the the content owner for permission before you do anything.
Why writing the post without an exact guide?
Scraping is a great thing to have in your toolbelt as technical marketing especially as you probably will find yourself in a situation where you’ll want to scrape your own content for one or another reason. Without knowing (and searching for) other options you’ll never question and improve your own approach.
How do you get started?
Check out this awesome tutorial, it’s very detailed and allowed me to finish my project.