Introduction to Cheerio
Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for server-side web scraping. It provides a familiar API to work with HTML or XML documents, making it a popular choice among developers who need to manipulate or extract data from web pages.
Getting Started with Cheerio
First, you need to install Cheerio using npm:
npm install cheerio
Loading HTML
You load HTML content using the cheerio.load()
function:
const cheerio = require('cheerio');
const $ = cheerio.load('<html><body><h1>Hello, World!</h1></body></html>');
console.log($('h1').text()); // Output: Hello, World!
Selecting Elements
Use familiar CSS selectors to select elements:
$('title').text();
$('.myClass').html();
$('#myId').attr('href');
Manipulating Elements
Cheerio allows you to manipulate elements such as setting text, HTML, or attributes:
$('h1').text('New Title');
$('.myClass').html('<span>Content</span>');
$('#myId').attr('href', 'http://example.com');
Traversing the DOM
Cheerio provides several methods for traversing DOM elements:
$('li').each(function(index, element) {
console.log($(this).text());
});
$('a').parent().addClass('newClass');
$('ul').children().removeClass('oldClass');
Working with Forms
Here are some examples of extracting data from forms:
$('form').serializeArray().forEach(function(item) {
console.log(item.name + ': ' + item.value);
});
Complete App Example
Let’s create a simple app that scrapes a web page and extracts all the hyperlinks:
const cheerio = require('cheerio');
const axios = require('axios');
async function scrapeLinks(url) {
try {
const { data } = await axios.get(url);
const $ = cheerio.load(data);
let links = [];
$('a').each((index, element) => {
links.push($(element).attr('href'));
});
return links;
} catch (error) {
console.error('Error scraping links:', error);
}
}
// Example usage
scrapeLinks('http://example.com').then((links) => {
console.log('Extracted links:', links);
});
Conclusion
Cheerio is a valuable tool for web scraping and DOM manipulation. Its API provides a powerful way to extract and manipulate data, making it an essential library for server-side developers.
Hash: 93e4b2003605b5a2df76eb9840eccabd4bea1affe79e205cee1112beb675c6fa