Ruby Q & A

 

How to parse XML and HTML in Ruby?

In Ruby, parsing XML and HTML documents is made straightforward with the help of libraries such as Nokogiri and REXML. These libraries provide tools for navigating and extracting data from structured documents. Here’s how you can parse XML and HTML in Ruby using the Nokogiri library:

 

  1. Install Nokogiri:

   Before you can use Nokogiri, you need to install it. You can do this via the RubyGems package manager:

  ```bash

   gem install nokogiri

   ```

 

  1. Require Nokogiri:

In your Ruby script or program, require the Nokogiri library:

 ```ruby

   require 'nokogiri'

   ```

 

  1. Parsing XML:

To parse an XML document, you can use Nokogiri’s `Nokogiri::XML` class. Here’s an example of how to parse an XML string:

```ruby

   xml_string = '<root><element>Content</element></root>'

   doc = Nokogiri::XML(xml_string)

   ```

You can then navigate the parsed document using CSS or XPath selectors to extract data:

 ```ruby

   content = doc.at_css('element').text

   ```

 

  1. Parsing HTML:

Nokogiri can also handle HTML documents. To parse an HTML document, use the `Nokogiri::HTML` class:

  ```ruby

   require 'nokogiri'




   html_string = '<html><body><h1>Hello, World!</h1></body></html>'

   doc = Nokogiri::HTML(html_string)

   ```

Just like with XML, you can use CSS or XPath selectors to access and manipulate elements within the HTML document:

 ```ruby

   heading_text = doc.at_css('h1').text

   ```

 

  1. Parsing from URLs:

Nokogiri can also parse XML or HTML directly from URLs:

```ruby

   require 'nokogiri'

   require 'open-uri'




   url = 'https://example.com/some-page'

   doc = Nokogiri::HTML(open(url))

   ```

 

  1. Handling Errors:

When parsing documents from external sources, it’s essential to handle potential exceptions, such as network errors or invalid HTML/XML. Use `begin`…`rescue` blocks to capture and manage any exceptions that may occur during parsing.

Nokogiri is a powerful and flexible library for parsing XML and HTML in Ruby. It provides robust support for navigating, querying, and manipulating structured documents, making it a popular choice for web scraping, data extraction, and working with XML-based APIs. If you need to work extensively with HTML and CSS selectors, Nokogiri’s ability to handle both XML and HTML is particularly valuable.

Previously at
Flag Argentina
Chile
time icon
GMT-3
Experienced software professional with a strong focus on Ruby. Over 10 years in software development, including B2B SaaS platforms and geolocation-based apps.