Introduction to Puppeteer Cluster
Puppeteer Cluster is a library that leverages Puppeteer, the popular headless browser automation tool, to run multiple parallel instances of headless browsers. This is particularly useful for tasks such as web scraping, automated testing, or crawling web pages, where performance and scalability are critical.
Getting Started
To install Puppeteer Cluster, you need to have Node.js installed. Run the following command to install the library:
npm install puppeteer-cluster
Here’s a basic example to get you started:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 5,
});
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const title = await page.title();
console.log(`Title of ${url} is ${title}`);
});
await cluster.queue('http://www.google.com');
await cluster.queue('http://www.github.com');
await cluster.idle();
await cluster.close();
})();
Useful APIs
-
Cluster.launch
Creates and launches a new cluster with specified options.
const cluster = await Cluster.launch({ concurrency: Cluster.CONCURRENCY_CONTEXT, maxConcurrency: 10, puppeteerOptions: { headless: true, }, });
-
Cluster.task
Defines a task for the cluster. The task should contain the code to be executed for each job.
await cluster.task(async ({ page, data: url }) => { await page.goto(url); const bodyHandle = await page.$('body'); const html = await page.evaluate(body => body.innerHTML, bodyHandle); console.log(html); await bodyHandle.dispose(); });
-
Cluster.queue
Adds a job to the cluster’s queue.
await cluster.queue('http://www.example.com'); await cluster.queue('http://www.wikipedia.org');
-
Cluster.idle
Waits until all queued tasks are finished.
await cluster.idle();
-
Cluster.close
Closes the cluster and all the browser instances it manages.
await cluster.close();
Example Application
Below is a more comprehensive example that demonstrates fetching the title and the first paragraph of multiple web pages:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 3,
});
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const title = await page.title();
const firstParagraph = await page.$eval('p', el => el.innerText);
console.log(`Title: ${title}`);
console.log(`First paragraph: ${firstParagraph}`);
});
const urls = [
'http://www.google.com',
'http://www.github.com',
'http://www.wikipedia.org',
];
for (const url of urls) {
await cluster.queue(url);
}
await cluster.idle();
await cluster.close();
})();
This demo script runs a cluster to process multiple URLs concurrently, efficiently fetching and logging the desired elements from each site.
Hash: 841a252390a4102790ecb57fc628af9d3fdc24375780dd6504a43af5cfcce02e