Comprehensive Guide to Pyquery for Web Scraping and Manipulation

Pyquery is a powerful and flexible library in Python that implements a jQuery-like API for web scraping and manipulation. With Pyquery, you can parse and manipulate HTML and XML documents effortlessly, making it a favorite among developers who wish to automate tasks or retrieve specific data from web pages. In this article, we’ll cover the key APIs provided by Pyquery, their practical usage, and provide a complete application example demonstrating its power.

Getting Started with Pyquery

To begin, install Pyquery using pip:

  pip install pyquery

Now, let’s dive into its features and common use cases.

Parsing and Manipulating HTML

Pyquery makes it easy to load and manipulate HTML documents. You can either parse HTML strings or load content from a URL.

1. Loading HTML from a String

  from pyquery import PyQuery as pq

  html = "<div><h1>Hello, World!</h1><p>Pyquery is amazing!</p></div>"
  doc = pq(html)
  print(doc('h1').text())  # Output: Hello, World!

2. Loading HTML from a URL

  doc = pq(url='https://example.com')
  print(doc('title').text())

Useful Pyquery APIs with Examples

3. Selecting Elements

Use CSS selectors to extract elements:

  html = "<ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>"
  doc = pq(html)
  items = doc('li')
  for item in items:
      print(item.text_content())

4. Manipulating Attributes

  html = "<a href='#'>Click here!</a>"
  doc = pq(html)
  doc('a').attr('href', 'https://updated-link.com')
  print(doc)  # Output: <a href='https://updated-link.com'>Click here!</a>

5. Adding and Removing Classes

  html = "<div class='test'>Sample Text</div>"
  doc = pq(html)
  doc('div').add_class('new-class')
  doc('div').remove_class('test')
  print(doc)  # Output: <div class='new-class'>Sample Text</div>

6. Appending and Prepending Content

  html = "<div></div>"
  doc = pq(html)
  doc('div').append('<p>Appended paragraph.</p>')
  doc('div').prepend('<p>Prepended paragraph.</p>')
  print(doc)  # Output: <div><p>Prepended paragraph.</p><p>Appended paragraph.</p></div>

7. Removing Elements

  html = "<div><p>Remove me!</p></div>"
  doc = pq(html)
  doc('p').remove()
  print(doc)  # Output: <div></div>

8. Traversing the DOM

  html = "<div><p>Parent Text</p><span>Child Text</span></div>"
  doc = pq(html)
  parent = doc('p').parent()
  print(parent)  # Output: <div><p>Parent Text</p><span>Child Text</span></div>

9. Filtering Elements

  html = "<ul><li>Apple</li><li>Orange</li><li>Banana</li></ul>"
  doc = pq(html)
  filtered = doc('li:contains("Apple")')
  print(filtered)  # Output: <li>Apple</li>

Building a Simple Web Scraping Application

Let’s build a basic application to extract article titles from a blog homepage.

  from pyquery import PyQuery as pq

  def scrape_blog_titles(url):
      # Load the web page
      doc = pq(url=url)

      # Extract titles
      titles = []
      for item in doc('h2.post-title').items():
          titles.append(item.text())

      return titles

  # Example usage
  blog_url = 'https://example-blog.com'
  titles = scrape_blog_titles(blog_url)
  print("Blog Titles:", titles)

In this example, we use the pq function to load the blog’s home page and use CSS selectors to extract titles under the h2.post-title tag. The method items() allows iteration over all matching elements.

Conclusion

Pyquery is a versatile library that simplifies the process of web scraping and HTML manipulation with its jQuery-inspired syntax. By leveraging the APIs we covered, you can build robust web scraping applications or quickly prototype solutions that involve web data extraction. Its ease of use and flexibility make it ideal for developers of all levels. Start exploring Pyquery today and unlock the power of effortless web scraping!