Senior CodeIgniter Developer Ex-Capgemini

CodeIgniter

Master Web Scraping Using CodeIgniter: A Comprehensive Guide

Web scraping, also known as data extraction, is the process of fetching data from websites and storing it locally or using it for further analysis. Whether you’re building a product that relies on up-to-date information, gathering insights, or simply collecting data for research purposes, web scraping can be a very useful tool.

CodeIgniter, a powerful PHP framework, has a simple and elegant toolkit to create full-featured web applications. With its ease of use and extensive libraries, you can seamlessly integrate web scraping functionalities into your projects.

In this post, we’ll look into how you can combine the power of CodeIgniter with web scraping to extract data from websites.

Prerequisites:

– Basic knowledge of CodeIgniter and PHP.

– CodeIgniter installed and configured on your development machine.

– Libraries for HTTP requests and parsing HTML like `Simple HTML DOM`.

1. Setting up the Controller

Create a new controller, say `Scraper.php`, inside the `application/controllers` directory.

```php

<?php

defined('BASEPATH') OR exit('No direct script access allowed');

class Scraper extends CI_Controller {

public function index() {

$this->load->view('scraper_view');

}

```

```php <?php defined('BASEPATH') OR exit('No direct script access allowed'); class Scraper extends CI_Controller { public function index() { $this->load->view('scraper_view'); } } ```

```php
<?php
defined('BASEPATH') OR exit('No direct script access allowed');

class Scraper extends CI_Controller {

    public function index() {
        $this->load->view('scraper_view');
    }

}
```

2. Install Simple HTML DOM Parser

Download the Simple HTML DOM library from its official site or via Composer. Then, place the `simple_html_dom.php` in the `application/libraries` directory.

Load the library in your controller:

```php

require_once APPPATH . 'libraries/simple_html_dom.php';

```

```php require_once APPPATH . 'libraries/simple_html_dom.php'; ```

```php
require_once APPPATH . 'libraries/simple_html_dom.php';
```

3. Building the Scraper Function

Let’s build a function to scrape titles from a sample blog page:

```php

public function fetch_titles() {

// The target URL

$url = 'https://sample-blog-website.com/posts';

// Use file_get_html() function from Simple HTML DOM

$html = file_get_html($url);

$titles = [];

// Loop through each article tag on the website

foreach ($html->find('article') as $article) {

$titles[] = $article->find('h2', 0)->plaintext;

}

// Return the titles as JSON

echo json_encode($titles);

}

```

```php public function fetch_titles() { // The target URL $url = 'https://sample-blog-website.com/posts'; // Use file_get_html() function from Simple HTML DOM $html = file_get_html($url); $titles = []; // Loop through each article tag on the website foreach ($html->find('article') as $article) { $titles[] = $article->find('h2', 0)->plaintext; } // Return the titles as JSON echo json_encode($titles); } ```

```php
public function fetch_titles() {
    // The target URL
    $url = 'https://sample-blog-website.com/posts';

    // Use file_get_html() function from Simple HTML DOM
    $html = file_get_html($url);

    $titles = [];

    // Loop through each article tag on the website
    foreach ($html->find('article') as $article) {
        $titles[] = $article->find('h2', 0)->plaintext;
    }

    // Return the titles as JSON
    echo json_encode($titles);
}
```

4. Examples of What You Can Do

4.1. Fetching Product Prices

If you’re building a price comparison tool, you can scrape product prices from different websites.

```php

public function fetch_product_price() {

$url = 'https://sample-ecommerce-site.com/product/12345';

$html = file_get_html($url);

// Assuming the price is in a span with a class "price"

$price = $html->find('span.price', 0)->plaintext;

echo json_encode(['price' => $price]);

}

```

```php public function fetch_product_price() { $url = 'https://sample-ecommerce-site.com/product/12345'; $html = file_get_html($url); // Assuming the price is in a span with a class "price" $price = $html->find('span.price', 0)->plaintext; echo json_encode(['price' => $price]); } ```

```php
public function fetch_product_price() {
    $url = 'https://sample-ecommerce-site.com/product/12345';
    $html = file_get_html($url);

    // Assuming the price is in a span with a class "price"
    $price = $html->find('span.price', 0)->plaintext;

    echo json_encode(['price' => $price]);
}
```

4.2. Extracting Article Metadata

Perhaps you want to fetch metadata like author name, published date, or tags associated with an article.

```php

public function fetch_article_metadata() {

$url = 'https://sample-news-site.com/article/12345';

$html = file_get_html($url);

$author = $html->find('span.author-name', 0)->plaintext;

$published_date = $html->find('meta[property="published_date"]', 0)->getAttribute('content');

$tags = [];

foreach ($html->find('ul.tag-list li') as $tag) {

$tags[] = $tag->plaintext;

}

echo json_encode([

'author' => $author,

'published_date' => $published_date,

'tags' => $tags

]);

}

```

```php public function fetch_article_metadata() { $url = 'https://sample-news-site.com/article/12345'; $html = file_get_html($url); $author = $html->find('span.author-name', 0)->plaintext; $published_date = $html->find('meta[property="published_date"]', 0)->getAttribute('content'); $tags = []; foreach ($html->find('ul.tag-list li') as $tag) { $tags[] = $tag->plaintext; } echo json_encode([ 'author' => $author, 'published_date' => $published_date, 'tags' => $tags ]); } ```

```php
public function fetch_article_metadata() {
    $url = 'https://sample-news-site.com/article/12345';
    $html = file_get_html($url);

    $author = $html->find('span.author-name', 0)->plaintext;
    $published_date = $html->find('meta[property="published_date"]', 0)->getAttribute('content');
    $tags = [];
    foreach ($html->find('ul.tag-list li') as $tag) {
        $tags[] = $tag->plaintext;
    }

    echo json_encode([
        'author' => $author,
        'published_date' => $published_date,
        'tags' => $tags
    ]);
}
```

5. Challenges and Considerations

Web scraping is powerful, but it comes with its own set of challenges:

Website Structure Changes: If the website updates its design or structure, your scraping code might break. Always have error handling in place.
Legal Concerns: Some sites prohibit scraping in their `robots.txt` file or terms of service. Always ensure you have the rights to scrape a website.
Rate Limiting: Making too many requests in a short period might get your IP blocked. Respect `Crawl-Delay` in `robots.txt` or use proxies.
Data Accuracy: Web scraping might not always guarantee 100% data accuracy. Always validate and clean your data.

Conclusion

CodeIgniter, combined with the power of Simple HTML DOM or similar libraries, provides a robust environment for web scraping tasks. While it’s a handy skill for a developer, always ensure you’re scraping responsibly and ethically. Happy coding!