Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving. Golang Example Web Scraping A collection of 4 posts. Ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. 07 January 2019. Command Line 99.
I stumbled across a scraper and crawler framework written in Go called Colly. Colly makes it really easy to scrape content from web pages with it’s fast speed and easy interface. I have always been interested in web scrapers ever since I did a project for my university studies and you can read about that project here. Before continuing, please note that scraping of websites is not always allowed and sometimes even illegal. In the guide below we will be parsing this blog, GoPHP.io.
To begin let’s take a look at the Colly Github page and scroll down to the example code listed there. We will create a new project with a new main.go file that looks like this:
You may need to use go get -u github.com/gocolly/colly/...
to download the framework into your go directory. Now let’s go ahead and change the url to the gophp.io website.
And then we can run the script by typing go run main.go
in your terminal making sure you are in the project directory when you do this. You can use ctrl+c
in your terminal to cancel as it may run for a long time. What do we get as our output? For me it looked like this:
What we see here is exactly what you would expect. Our program parsed all the urls on the main gophp.io page and then proceeded to the first link. This first link is a post at gophp.io but the first link on that page is a link to Virtualbox and our program will keep looping until it stops finding links. That could be a long time and unless you want to make a search engine spider it won’t be the most efficent. What I want is a server that I can call on from a PHP script that just fetches and formats the data I need. Luckily Colly has a complete example of what we need, a scraper server.
What does the above code do? It will start a webserver running locally on your machine on port 7171. It takes a url parameter and returns all the links found on the url you input. Let’s give it a go by going to http://127.0.0.1:7171/?url=https://gophp.io/
. Here is an example of the json encoded output we get:
The above json output is only 1 level deep. Notice that it does not keep finding links on the pages it finds. This is great because now we could use this program as a sort of microservice. A PHP application could make calls to this microservice and receive all links for the specified url which could later be processed by the PHP application. Now, links are good but we might want to parse other content on the page. Let’s customize our code for this purpose.
Queries For Specific Content With Colly
If we take a look at the source of gophp.io we can see that every title has the css class entry-title
which we can use for our query. We will modify the handler function by adding another map for headings. I am only including the section of code that I have changed below:
Golang Web Development Tutorial
Now if we restart our program and navigate to our page on port 7171 again we will see some additional output in our json response.
As you can see we have now parsed all the titles on the page and added them to our json output. Using queries we can make very general or specific parsers for any kind of website.
Golang Http Server
I hope this guide helps someone get started with web scraping. There are several real world examples in the documentation if you would like to learn more. I would love to hear your feedback, questions and comments below!