This article serves two purposes. 1) To show you that there are basically no limits to what can be done with Google Sheets. 2) To give you a glimpse into web scraping.
The key takeaway for you should be that there is almost always a way of automating a task, just make sure not to fall into this trap.
What we’re doing
Basically we’re building a way to extract structured data from seemingly unstructured data (a website).
- I extract all links from a given website.
- I filter the links to only look at specific links (in the example case: product pages)
- From each given filtered URL I grab a specific item on the page and extract it’s value to the sheet. (in the example the product price of the item)
I’m utilising a few advanced functions that we will partly come back to later:
importxml()in conjunction with XPath Query Language – this is the secret sauce to the example. This function enables us to import contents of a remote website (or other files)
regexmatch()is one of three regex-related functions google sheets offers which I’ll explain in a future lesson.
filter()is a basic function and should definitely be in your personal tool-stack.
With these functions in place we could in theory build a product feed from a website that we don’t even own. And with a bit of google apps script we could utilise these functions to build a price tracker for a website, pretty neat, right?
Update: I built another example – a reddit rank tracker. You’ll have copy it and set up a time trigger in Tools -> Script Editor. Also set the spreadsheet recalculation settings to “On change and every hour” (File -> Spreadsheet settings -> Calculation)
Update 2: Looking for more advanced web scraping techniques. Check out my article on web scraping with Ruby and Nokogiri.