
Web scraping with Puppeteer - Quick Start
Puppeteer is a Node library from the Google Chrome team we can use to control a headless Chrome instance. With Puppeteer you can make screenshots, track page loading performance, generate PDF from web page, scrape web pages and a lot of more.
Marcin Łącki |
30 Sep 2020
### Prerequisites
Node.js is installed on your computer.
# Installing Puppeteer
Create a project folder `MyTestProject` and run below command:
```bash
npm install puppeteer
```
# Catch the Screen
In project folder create a file Screenshot.js:
```javascript
const puppeteer = require('puppeteer');
console.log("Hello j-labs");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.j-labs.pl');
await page.screenshot({path: 'jlabs.png'});
await browser.close();
})();
```
##Screenshot.js file
We have to inform Node.js about puppeteer library usage:
```javascript
const puppeteer = require('puppeteer');
```
This line of code contains a traditional "Hello World". It will be displayed in the console window:
```javascript
console.log("Hello j-labs");
```
The browser instance is created using 'launch()' method. To get the page object 'newPage()' method has to be used on the browser object:
```javascript
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
})();
```
Set the web page to start with:
```javascript
await page.goto('https://www.j-labs.pl');
```
Take a screenshot and save in your project folder using:
```javascript
await page.screenshot({path: 'jlabs.png'});
```
It's important to wait for the browser to close:
```javascript
await browser.close();
```
# Let`s grab author and post date from j-labs blog
The script below scrapes post date and the author from the first page of j-labs blog. On the beginning we need to find proper selector on the page. It's as easy as launching Chrome Developer Tools and finding it. In our case it's `div.NewsSummaryPostdate`
```javascript
const puppeteer = require("puppeteer");
var fs = require("fs");
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch({ headless: true });
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://blog.j-labs.pl/`);
await page.waitForSelector("div.NewsSummaryPostdate");
var news = await page.evaluate(() => {
var contentList = document.querySelectorAll(`div.NewsSummaryPostdate`);
var dataArray = [];
for (var i = 0; i < contentList.length; i++) {
dataArray[i] = {
title: contentList[i].innerText.trim()
};
}
return dataArray;
});
// console.log(news);
await browser.close();
// Writing the news inside a json file
fs.writeFile("output.json", JSON.stringify(news), function(err) {
if (err) throw err;
console.log("Output Saved");
});
console.log(success("Browser Closed"));
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser closed with error"));
}
})();
```
Variable `fs` is needed to save the file with output:
```javascript
var fs = require("fs");
```
The script below gets a list of nodes matching `div.NewsSummaryPostdate` selector:
```javascript
var contentList = document.querySelectorAll(`div.NewsSummaryPostdate`);
```
In the `for` loop, we get `innerText` for each node:
```javascript
for (var i = 0; i < contentList.length; i++) {
dataArray[i] = {
title: contentList[i].innerText.trim()
};
}
```
Time to save the output to file:
```javascript
fs.writeFile("output.json", JSON.stringify(news), function(err) {
if (err) throw err;
console.log("Output Saved");
});
```
# Summary
Pupeeter is a tool with many possibilities. You can use it for page performance tracking, automatic testing or creating single page apps etc. Automated testing is much faster than Selenium. The two main disadvantages is that it works only with Chrome and as a Node.js library it supports only it's language.
## Useful links
[https://pptr.dev/](https://pptr.dev/)
[https://github.com/puppeteer/puppeteer](https://github.com/puppeteer/puppeteer)
Marcin Łącki
Did you like this article?
0,0 / 0