Building Web Scrapers with Elixir and Nokogiri
In the modern age of data-driven decision making, web scraping has become an essential technique for gathering information from websites. Whether it’s for competitive analysis, market research, or extracting valuable insights, web scraping can significantly streamline data acquisition processes. Elixir, a robust and scalable programming language built on the Erlang VM, provides an excellent foundation for building web scrapers. When combined with Nokogiri, a powerful HTML parsing library, you have a winning duo that can make web scraping tasks efficient and straightforward. In this blog post, we’ll dive into the world of web scraping with Elixir and Nokogiri and explore how to create effective web scrapers to extract data from websites.
Prerequisites
Before we begin, it’s essential to have a basic understanding of Elixir programming and HTML structure. Familiarity with HTTP requests and response handling will also be beneficial. Ensure you have Elixir and its package manager Hex installed on your system to follow along with the code samples.
Setting up the Project
Let’s start by creating a new Elixir project to house our web scraper. Open your terminal and run the following commands:
bash $ mix new WebScraper $ cd WebScraper
Now that we have our project set up, let’s add the Nokogiri dependency to our project by adding it to the mix.exs file:
elixir defp deps do [ {:nokogiri, "~> 0.18.0"} ] end
Next, fetch the dependency by running:
bash $ mix deps.get
Understanding Nokogiri
Nokogiri is a popular HTML and XML parsing library for Elixir, inspired by the Nokogiri gem in Ruby. It allows us to parse and manipulate HTML and XML documents with ease. Nokogiri’s intuitive API provides various methods to extract data based on CSS or XPath selectors.
To leverage Nokogiri, let’s first create a module that will handle web scraping. Create a new file named scraper.ex in the lib directory and define the module as follows:
elixir defmodule WebScraper.Scraper do require Nokogiri def scrape(url) do case HTTPoison.get(url) do {:ok, %HTTPoison.Response{body: body}} -> {:ok, document} = Nokogiri.parse(body) # Your scraping logic goes here {:error, reason} -> {:error, reason} end end end
In this module, we first import the Nokogiri library using the require macro. Inside the scrape/1 function, we use HTTPoison to make an HTTP GET request to the specified URL. If the request is successful, we parse the HTML content using Nokogiri’s parse/1 function, which returns a document representation of the HTML.
Scraping Data with Nokogiri
With our basic scraper module set up, let’s dive into extracting data from a website. For this example, we’ll scrape the latest news headlines from a hypothetical news website. Assume the website has the following HTML structure:
html <!DOCTYPE html> <html> <head> <title>News Website</title> </head> <body> <div class="news"> <article> <h2>Breaking News 1</h2> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p> </article> <article> <h2>Breaking News 2</h2> <p>Nullam suscipit magna nec odio euismod, vel tincidunt elit feugiat.</p> </article> <!-- More articles --> </div> </body> </html>
Our goal is to extract the headlines and accompanying news snippets from the website. To do this, we’ll update the WebScraper.Scraper module:
elixir defmodule WebScraper.Scraper do # ... (previous code) def scrape(url) do case HTTPoison.get(url) do {:ok, %HTTPoison.Response{body: body}} -> {:ok, document} = Nokogiri.parse(body) headlines = extract_headlines(document) {:ok, headlines} {:error, reason} -> {:error, reason} end end defp extract_headlines(document) do document |> Nokogiri.find(".news article") |> Enum.map(&extract_headline(&1)) end defp extract_headline(article) do %{title: title, snippet: snippet} = %{ title: Nokogiri.find(article, "h2") |> hd() |> Nokogiri.text(), snippet: Nokogiri.find(article, "p") |> hd() |> Nokogiri.text() } {title, snippet} end end
Let’s understand the changes we made to the WebScraper.Scraper module. We introduced two new private functions: extract_headlines/1 and extract_headline/1. The former function takes the parsed document and uses Nokogiri’s find/2 function to locate all the articles within the div with class “news.” We then use Enum.map/2 to apply the extract_headline/1 function to each article.
The extract_headline/1 function, in turn, takes an article node and uses Nokogiri’s find/2 function again to extract the headline (h2) and the news snippet (p). The extracted data is returned as a tuple of {title, snippet}.
With these changes in place, we can now call our WebScraper.Scraper.scrape/1 function and obtain the latest headlines from our hypothetical news website:
elixir url = "https://www.example-news.com" case WebScraper.Scraper.scrape(url) do {:ok, headlines} -> IO.inspect(headlines) {:error, reason} -> IO.puts("Error: #{reason}") end
Handling Errors and Timeouts
While web scraping can be powerful, it’s essential to handle potential errors gracefully. Websites might have intermittent downtime or impose rate limits on requests, leading to timeouts or HTTP errors. To address these scenarios, we can add timeout and retry mechanisms to our scraper. Let’s update our WebScraper.Scraper module:
elixir defmodule WebScraper.Scraper do # ... (previous code) def scrape(url, retries \\ 3) do case HTTPoison.get(url, timeout: 10_000) do {:ok, %HTTPoison.Response{body: body}} -> {:ok, document} = Nokogiri.parse(body) headlines = extract_headlines(document) {:ok, headlines} {:error, _reason} when retries > 0 -> scrape(url, retries - 1) {:error, reason} -> {:error, reason} end end # ... (previous code) end
In the updated scrape/2 function, we introduced a retries parameter with a default value of 3. If an error occurs during the HTTP request, the function will recursively retry the request up to the specified number of retries, decrementing the retries counter each time. Additionally, we set a timeout of 10,000 milliseconds for each HTTP request to avoid indefinite waits in case of unresponsive websites.
Conclusion
Congratulations! You’ve now learned how to build web scrapers using Elixir and Nokogiri. Armed with the power of Elixir’s concurrency model and Nokogiri’s HTML parsing capabilities, you can efficiently collect data from websites and extract valuable information for various use cases. Remember to be respectful of website policies and robots.txt guidelines when web scraping to ensure ethical and legal data acquisition practices.
In this blog post, we covered the basics of setting up an Elixir project, integrating Nokogiri as the parsing library, and implementing a web scraper capable of extracting data from websites. However, the world of web scraping is vast, and there’s much more to explore, such as handling pagination, interacting with websites using forms, and handling dynamically loaded content. So, feel free to further expand your knowledge and build more sophisticated web scrapers with Elixir and Nokogiri. Happy scraping!
Table of Contents