Master Web Scraping and Automation with Puppeteer Cluster for Efficient Performance in 2023

Introduction to Puppeteer Cluster

Puppeteer Cluster is a powerful Node.js library that allows you to run multiple Puppeteer instances in parallel. It is designed to make web scraping and automation tasks faster and more efficient by utilizing the power of concurrency.

Getting Started

  
    const { Cluster } = require('puppeteer-cluster');

    (async () => {
      const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
      });

      await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        const title = await page.title();
        console.log(`Title of ${url} is ${title}`);
      });

      cluster.queue('http://www.wikipedia.org/');
      cluster.queue('http://www.google.com/');

      await cluster.idle();
      await cluster.close();
    })();
  

Useful APIs

Cluster.launch

Initializes a new Puppeteer cluster.

  
    const cluster = await Cluster.launch({
      concurrency: Cluster.CONCURRENCY_PAGE,
      maxConcurrency: 2,
    });
  

Cluster.task

Defines the task that will be executed for each job in the queue.

  
    await cluster.task(async ({ page, data: url }) => {
      await page.goto(url);
      const title = await page.title();
      console.log(`Title of ${url} is ${title}`);
    });
  

Cluster.queue

Adds a URL or a task to the queue.

  
    cluster.queue('http://www.wikipedia.org/');
    cluster.queue('http://www.google.com/');
  

Cluster.idle

Waits until all tasks are executed and the cluster becomes idle.

  
    await cluster.idle();
  

Cluster.close

Closes all puppeteer instances and frees resources.

  
    await cluster.close();
  

Complete Example App using Puppeteer Cluster

  
    const { Cluster } = require('puppeteer-cluster');

    (async () => {
      const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
      });

      // Task to scrape titles from URLs
      await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        const title = await page.title();
        console.log(`Title of ${url} is ${title}`);
      });

      // Adding URLs to the queue
      const urls = [
        'http://www.wikipedia.org/',
        'http://www.google.com/',
        'http://www.github.com/',
        'http://www.stackoverflow.com/'
      ];

      urls.forEach(url => cluster.queue(url));

      // Wait for the cluster to finish
      await cluster.idle();
      await cluster.close();
    })();
  

With Puppeteer Cluster, you can efficiently manage multiple Puppeteer instances and perform web scraping at scale, boosting your productivity and performance.

Hash: 841a252390a4102790ecb57fc628af9d3fdc24375780dd6504a43af5cfcce02e

Leave a Reply

Your email address will not be published. Required fields are marked *