Introduction to X-Ray Crawler
X-Ray Crawler is a powerful web scraping tool designed to help developers easily extract data from websites. By leveraging its range of APIs, you can efficiently navigate complex websites and obtain structured data in various formats. This guide introduces the essential features of X-Ray Crawler and provides dozens of useful API explanations with accompanying code snippets.
Getting Started
First, install the X-Ray Crawler using npm:
npm install x-ray
Basic Usage
Here’s a simple example of using X-Ray to scrape the titles of articles from a webpage:
const Xray = require('x-ray'); const x = Xray();
x('https://example.com', '.article', [{
title: 'h2',
}]) .then(result => {
console.log(result);
});
Scraping Multiple Fields
Scrape multiple data fields from a single page:
x('https://example.com', '.article', [{
title: 'h2',
summary: 'p.summary',
}]) .then(result => {
console.log(result);
});
Paginated Scraping
Extract data from multiple pages efficiently:
x('https://example.com', {
articles: x('.article', [{
title: 'h2',
link: 'a@href'
}]),
nextPage: '.next@href'
})(function(err, obj) {
const nextUrl = obj.nextPage;
console.log(obj.articles);
if (nextUrl) {
// Follow pagination link
}
});
Advanced Selection
Use advanced CSS selectors to target specific elements:
x('https://example.com', {
articles: x('.article', [{
title: '.title',
date: '.date',
author: '.author'
}])
}) .then(result => {
console.log(result);
});
Scraping Lists
Scrape lists of items easily:
x('https://example.com', '.list-item', [{
item: 'span'
}]) .then(result => {
console.log(result);
});
Advanced Features
Pagination Handling
Handling complex pagination and extracting data from all pages:
function paginate(result, url) {
console.log(result);
if (!url) return;
x(url, {
articles: x('.article', [{
title: 'h2',
summary: 'p.summary',
}]),
nextPage: '.next@href'
})(function(err, obj) {
paginate(obj.articles, obj.nextPage);
});
}
x('https://example.com', {
articles: x('.article', [{
title: 'h2',
summary: 'p.summary',
}]),
nextPage: '.next@href'
})(function(err, obj) {
paginate(obj.articles, obj.nextPage);
});
Crawling Nested URLs
Extracting data from nested URLs can be done as shown in this example:
x('https://example.com', {
categories: x('.category', [{
name: '.name',
subcategories: x('.subcategory', [{
name: '.name'
}])
}])
}) .then(result => {
console.log(result);
});
Building an App with X-Ray Crawler
Let’s build a basic node.js app using the introduced APIs:
const express = require('express'); const Xray = require('x-ray'); const app = express(); const x = Xray();
app.get('/scrape', (req, res) => {
x('https://example.com', '.article', [{
title: 'h2',
summary: 'p.summary'
}])
.then(result => {
res.json(result);
});
});
app.listen(3000, () => {
console.log('Server started on port 3000');
});
This application sets up an Express server which scrapes data from example.com and returns it as a JSON response.
By utilizing X-Ray Crawler’s powerful features, developers can efficiently extract and process web data for various applications. Happy scraping!
Hash: eda45b2446cbde8ba14dc7527a97bf621240ab6cb56d9303c7ade79723fb5ded