So you want to learn how to satisfy those data cravings, and want to do it all inside of a single Laravel command? Then you’ve come to the right place!
Website scraping (in general terms) is the extraction of data from any given website. For this example, we’ll be using the Funko POP! Vinyl website (including all collections such as Animation, Disney and Games) and scraping for product information, but creating the script to scrape any part of the Funko website for product data will be a simple task.
Getting Set Up
To get started I’m going to assume you have a Laravel 5.4 project handy and already setup, but if you don’t you can follow the installation procedure here.
The next step is to then install the Goutte package by running composer require weidner/goutte
. We’ll then need to add the provider and alias to the config/app.php file.
// Provider
Weidner\Goutte\GoutteServiceProvider::class,
// Alias
'Goutte' => Weidner\Goutte\GoutteFacade::class,
First run the command php artisan make:command ScrapeFunko
and then open the ScrapeFunko file inside of the App/Commands directory. We’ll now write the code to begin scraping the POP! Vinyl collections for product data.
Next up we’ll update the signature variable value, this will be the command we run in the terminal via the command php artisan scrape:funko
protected $signature = 'scrape:funko';
protected $description = 'Funko POP! Vinyl Scraper';
We will then need to create a variable containing an array of the collection slugs collected from the URL of each collection page (https://funko.com/collections/pop-vinyl) via the following variable:
protected $collections = [
'animation',
'disney',
'games',
'heroes',
'marvel',
'monster-high',
'movies',
'pets',
'rocks',
'sports',
'star-wars',
'television',
'the-vault',
'the-vote',
'ufc',
];
We’ll also update the handle function to the following:
public function handle()
{
foreach ($collections as $collection) {
$this->scrape($collection);
}
}
Building the Scrape Function
Before we start with the scrape function we’ll need to create a new enviroment variable by going to our .env file and adding the following line:
FUNKO_POP_URL=https://funko.com/collections/pop-vinyl
Next up we’ll write the scrape function. Below is the code which we’ll end up with followed by a description of what the code does.
/**
* For scraping data for the specified collection.
*
* @param string $collection
* @return boolean
*/
public static function scrape($collection)
{
$crawler = Goutte::request('GET', env('FUNKO_POP_URL').'/'.$collection);
$pages = ($crawler->filter('footer .pagination li')->count() > 0)
? $crawler->filter('footer .pagination li:nth-last-child(2)')->text()
: 0
;
for ($i = 0; $i < $pages + 1; $i++) {
if ($i != 0) {
$crawler = Goutte::request('GET', env('FUNKO_POP_URL').'/'.$collection.'?page='.$i);
}
$crawler->filter('.product-item')->each(function ($node) {
$sku = explode('#', $node->filter('.product-sku')->text())[1];
$title = trim($node->filter('.title a')->text());
print_r($sku.', '.$title);
});
}
return true;
}
So to start off we send a GET request to the Funko POP! Vinyl URL followed by the current collection slug being scraped. This will basically return the website data as if you visited the URL in your browser. The next step is to get a count of the pages (since the collections can have many pages) and then begin to loop through them. We then proceed to check if the current page being checked is the first or not (if not, we will get the page data for the paginated page). The next step is to loop through all of the .product-item
containers on the page, we will then be able to access data within each of these nodes by filtering the .product-sku
element to get the product SKU and getting the link text as the product name.
And that’s it! You now have a basic website scraper. Additionally you could integrate Amazon Web Services (AWS) S3 as well as hook up a database to store the collections and product data. Check out our specialist AWS consulting services for more details.
You can get the full source code for the command here, and check out a repository with S3 and database integration here.