Puppeteer Cluster Unleashing the Power of Scalable Headless Browsing for SEO

Introduction to Puppeteer Cluster

Puppeteer Cluster is a library that leverages Puppeteer, the popular headless browser automation tool, to run multiple parallel instances of headless browsers. This is particularly useful for tasks such as web scraping, automated testing, or crawling web pages, where performance and scalability are critical.

Getting Started

To install Puppeteer Cluster, you need to have Node.js installed. Run the following command to install the library:

  
    npm install puppeteer-cluster
  

Here’s a basic example to get you started:

  
    const { Cluster } = require('puppeteer-cluster');

    (async () => {
      const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 5,
      });

      await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        const title = await page.title();
        console.log(`Title of ${url} is ${title}`);
      });

      await cluster.queue('http://www.google.com');
      await cluster.queue('http://www.github.com');

      await cluster.idle();
      await cluster.close();
    })();
  

Useful APIs

  • Cluster.launch

    Creates and launches a new cluster with specified options.

          
            const cluster = await Cluster.launch({
              concurrency: Cluster.CONCURRENCY_CONTEXT,
              maxConcurrency: 10,
              puppeteerOptions: {
                headless: true,
              },
            });
          
        
  • Cluster.task

    Defines a task for the cluster. The task should contain the code to be executed for each job.

          
            await cluster.task(async ({ page, data: url }) => {
              await page.goto(url);
              const bodyHandle = await page.$('body');
              const html = await page.evaluate(body => body.innerHTML, bodyHandle);
              console.log(html);
              await bodyHandle.dispose();
            });
          
        
  • Cluster.queue

    Adds a job to the cluster’s queue.

          
            await cluster.queue('http://www.example.com');
            await cluster.queue('http://www.wikipedia.org');
          
        
  • Cluster.idle

    Waits until all queued tasks are finished.

          
            await cluster.idle();
          
        
  • Cluster.close

    Closes the cluster and all the browser instances it manages.

          
            await cluster.close();
          
        

Example Application

Below is a more comprehensive example that demonstrates fetching the title and the first paragraph of multiple web pages:

  
    const { Cluster } = require('puppeteer-cluster');

    (async () => {
      const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 3,
      });

      await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        const title = await page.title();
        const firstParagraph = await page.$eval('p', el => el.innerText);
        console.log(`Title: ${title}`);
        console.log(`First paragraph: ${firstParagraph}`);
      });

      const urls = [
        'http://www.google.com',
        'http://www.github.com',
        'http://www.wikipedia.org',
      ];

      for (const url of urls) {
        await cluster.queue(url);
      }

      await cluster.idle();
      await cluster.close();
    })();
  

This demo script runs a cluster to process multiple URLs concurrently, efficiently fetching and logging the desired elements from each site.

Hash: 841a252390a4102790ecb57fc628af9d3fdc24375780dd6504a43af5cfcce02e

Leave a Reply

Your email address will not be published. Required fields are marked *