This tutorial shows you how to scrape web pages by using Google Spreadsheets instead of using web scraping tools. The suggested method is very useful for a smaller set of URLs.
What is Web Scraping?
Web Scraping is a process of extracting specific information from web pages.
How to Scrape a Website?
You can use data extraction in Screaming Frog (SEO Spider Tool). However, if you don’t want to pay for a Screaming Frog license (the free version is limited to 500 URLs), or if you just don’t like that tool (that’s hard to believe), then you may be looking for something else.
There are some great alternatives such as Deepcrawl and Import.io or you scrape websites using a python library. However, I’d like to show you how to use Google Spreadsheet to extract data from a list of URLs. It’s quite simple, and you need to understand just two things to do that:
XPath
ImportXML
XPath
XPath helps you to navigate through an XML document or locate specific elements and attributes. It’s a simple way to select an element in an HTML page.
Many articles about XPath have been written and I’ll refer you to W3C Schools to learn a bit if you’re not familiar with it just yet.
Let’s imagine I want to know all blog posts published on Moz’s blog this year with the author names and the number of comments.
Title: //h1
Author’s name: //*[@id=”wrap”]/main[1]/article/header/div/div/div[1]/div[2]/a
The number of comments: //*[@id=”comments”]/div[1]/div/div[1]/h2/span
It’s always important to understand the concept to be able to use it in different situations, but in some cases, you can cheat a little bit and use Chrome Developer Tools instead to get XPath.
Copying XPath via Chrome Developer Tools may not work every time. I’ve done some tests and this simple method didn’t work for me in 25% of cases, therefore, I can recommend learning and understanding XPath.
ImportXML
The magic function that allows importing data from other documents is IMPORTXML.
Unfortunately, ImportXML doesn’t load JavaScript, so you won’t be able to use this function if the content of the document is generated by JavaScript (jQuery, etc.)
How to use ImportXML
=IMPORTXML(URL,XPATH)
URL – The URL of the document from which you want to extract information.
XPATH – The path to the element you want to extract
To get the heading of Mike King’s post, you can use the formula below.
=IMPORTXML(“https://moz.com/blog/the-technical-seo-renaissance”,”//h1”)
How to get a list of URLs? Sitemap.xml!
As you can see, extract information from one URL is easy, but it would still be better to do it manually (copy and paste). The real benefit of using this function is when you want to extract some information from a list of URLs.
If importing information from multiple pages but the same domain is the goal, the sitemap.xml file is here to help.
There are a bunch of tools and guides how to get a list of all URLs in the sitemap.xml file. I’m sure you can find some easy ones, but I’d like to use the function IMPORTXML again because it’s simple, right?
Knowing XPath, it just took 10 seconds to write a formula to import all URLs:
=IMPORTXML(“https://moz.com/blog-sitemap.xml”, “//url/loc”)
The formula above would work for many sitemap files, but this time, I got a nice error message:
Error: Imported content is empty.
I was quite sure that the XPath was correct so I did some small research on the Internet. Long story short, a namespace is declared at the beginning of the sitemap so you can’t directly request //url/loc.
<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
If you’re interested in a detailed explanation, read Ben Ellwood’s post, who wrote about this first. If not, just see the XPath formula below:
//*[local-name() =’url’]/*[local-name() =’loc’]
Please note that I don’t scrape Moz’s sitemap in the spreadsheet to which I link at the end of this post. The spreadsheet should be considered as an example and not as a final solution.
Google Docs Caching
Google caches the results for about two hours, so you need to either add a fragment identifier (#) to the URL or just wait for a little time to get the updated results.
Original URL: https://moz.com/blog-sitemap.xml
Modified URL: https://moz.com/blog-sitemap.xml#test1
Conclusion
From the spreadsheet I just created and with a little help from pivot tables, I can see which blog posts had the highest engagement (the number of comments) and find out topics the SEO community is eager about or find the most popular authors, etc. It’s up to you what data and why you want to import.
All the functions in the spreadsheet must be adjusted for your need before using because each site has a different structure. I just wanted to show you an example of how to use it because I believe that every SEO magician should be familiar with this function.
IMPORTXML is a powerful function, and I’m only scratching the surface with my example. I’d love to hear about some cool ways and ideas on how to use this function.
UPDATE: If you want to know more, there is a great guide to ImportXML from Distilled.
Comments