Mastering the Bleach Python Library: A Comprehensive Guide
Bleach is a lightweight yet robust HTML sanitization library in Python. It is often used to filter malicious or threatening content in web applications by only allowing specific HTML, attributes, protocols, and tags. With its simple API, Bleach ensures that your application remains secure while maintaining user-generated content’s integrity.
Features of Bleach
- Sanitizes unsafe HTML content with customizable rules
- Performs linkification (turn plain URLs into clickable links)
- Has an extendable, customizable framework
- Lightweight and easy to integrate with Python projects
Installing Bleach
pip install bleach
Key APIs and Their Usage
1. Sanitize HTML
The bleach.clean
function is one of the most important APIs in this library. It allows you to sanitize HTML strings by removing dangerous tags, attributes, or protocols while preserving allowed ones.
import bleach dirty_html = '<script>alert("XSS!")</script><b>bold text</b>' clean_html = bleach.clean(dirty_html, tags=['b'], attributes={}, protocols=[], strip=False) print(clean_html) # Output: <b>bold text</b>
2. Linkify URLs
The bleach.linkify
function identifies plain-text URLs and converts them into clickable HTML links.
import bleach text = "Check out https://github.com/!" linkified_text = bleach.linkify(text) print(linkified_text) # Output: Check out <a href="https://github.com/">https://github.com/</a>!
3. Customizing Allowed Tags and Attributes
Bleach allows you to define your own list of acceptable HTML tags and their attributes.
allowed_tags = ['a', 'b', 'i', 'u'] allowed_attributes = {'a': ['href', 'title']} sanitized = bleach.clean('<div>not allowed</div><a href="example.com" onclick="hack()">safe link</a>', tags=allowed_tags, attributes=allowed_attributes) print(sanitized) # Output: <a href="example.com">safe link</a>
4. Extending Bleach for Custom Filtering
You can also use Bleach for custom sanitization. Here’s an example that only allows HTTPS links:
from bleach.sanitizer import Cleaner class CustomCleaner(Cleaner): def allowed_protocols(self, tag, attr): if attr['value'].startswith('https'): return True return False cleaner = CustomCleaner(tags=['a'], attributes={'a': ['href']}) result = cleaner.clean('<a href="http://example.com">Bad Link</a><a href="https://example.com">Good Link</a>') print(result) # Output: <a href="https://example.com">Good Link</a>
Building a Simple App Example with Bleach
Let’s create an example where Bleach sanitizes user input in a Flask web application.
from flask import Flask, request import bleach app = Flask(__name__) @app.route('/submit', methods=['POST']) def sanitize(): user_input = request.form['content'] clean_content = bleach.clean(user_input, tags=['b', 'i'], attributes={}) return f"Sanitized Content: {clean_content}" if __name__ == '__main__': app.run(debug=True)
With this simple /submit
endpoint, users can submit HTML content, and the server sanitizes it before rendering or processing. This ensures that no malicious scripts are executed.
Conclusion
Using the Bleach library, you can make your Python applications more resilient to security threats like XSS. Its flexible APIs provide control over how HTML content is sanitized, making it a great choice for web developers.