How to parse XML and HTML in Ruby?
In Ruby, parsing XML and HTML documents is made straightforward with the help of libraries such as Nokogiri and REXML. These libraries provide tools for navigating and extracting data from structured documents. Here’s how you can parse XML and HTML in Ruby using the Nokogiri library:
- Install Nokogiri:
Before you can use Nokogiri, you need to install it. You can do this via the RubyGems package manager:
```bash gem install nokogiri ```
- Require Nokogiri:
In your Ruby script or program, require the Nokogiri library:
```ruby require 'nokogiri' ```
- Parsing XML:
To parse an XML document, you can use Nokogiri’s `Nokogiri::XML` class. Here’s an example of how to parse an XML string:
```ruby xml_string = '<root><element>Content</element></root>' doc = Nokogiri::XML(xml_string) ```
You can then navigate the parsed document using CSS or XPath selectors to extract data:
```ruby content = doc.at_css('element').text ```
- Parsing HTML:
Nokogiri can also handle HTML documents. To parse an HTML document, use the `Nokogiri::HTML` class:
```ruby require 'nokogiri' html_string = '<html><body><h1>Hello, World!</h1></body></html>' doc = Nokogiri::HTML(html_string) ```
Just like with XML, you can use CSS or XPath selectors to access and manipulate elements within the HTML document:
```ruby heading_text = doc.at_css('h1').text ```
- Parsing from URLs:
Nokogiri can also parse XML or HTML directly from URLs:
```ruby require 'nokogiri' require 'open-uri' url = 'https://example.com/some-page' doc = Nokogiri::HTML(open(url)) ```
- Handling Errors:
When parsing documents from external sources, it’s essential to handle potential exceptions, such as network errors or invalid HTML/XML. Use `begin`…`rescue` blocks to capture and manage any exceptions that may occur during parsing.
Nokogiri is a powerful and flexible library for parsing XML and HTML in Ruby. It provides robust support for navigating, querying, and manipulating structured documents, making it a popular choice for web scraping, data extraction, and working with XML-based APIs. If you need to work extensively with HTML and CSS selectors, Nokogiri’s ability to handle both XML and HTML is particularly valuable.