Google does not allow robots and scraper scripts to fetch content from their search engine. You can read the details here. Consequently, this tutorial is meant to be instructional but not for use in a production application.
A scraper is an automated script that downloads a website’s template/HTML and then parses the content to extract data in a meaningful way. This is very similar to the web crawling. Google’s search engine crawls sites to index the web and make it easy for us to find relevant content online.
This is a simple Node.js program that can start a Google News search and then extract an article’s title, description, image, and url. The same concept can be used for a news feed inside of your application or website.
Building the Scraper
First of all, you want to do is create a folder for the project and navigate to it in your terminal window (command prompt for PC).
This will create a new
package.json inside of your project folder.
npm install request --save and
npm install cheerio --save.
request module allows us to run an http request from the server-side using node. This is the module that will fetch Google New’s website template/HTML.
cheerio module is a library that uses jQuery-like syntax to interact with our HTML using node.
From here, all you need to do is create a file named
scraper.js and paste the following code:
You will notice that the
searchUrl is constructed using a
searchTerm variable. The input typed in the search box will be a part of the url of the page with the search results when you run a google search. We are applying the same concept here.
Lastly, you can see the console print out the first page’s results in the format we created when you run
That’s it! This is simple example of a Node.js scraper. You can create your own to crawl thousands of sources and fetch valuable data for users.
You can check out my other tutorial on setting up a basic Node.js architecture with similar functionality to Google’s popular DBAAS, Firebase, here.