PHP image crawler/downloader

Hi All! Happy new year 2015! This is the first post of 2015. Yeaay!

For those of you who are looking for a simple script to grab or scrape all the images tag <img> and download them on a web page with PHP, this post might be one of the alternative which you can use to that purpose.

First of all, if you want to scrape a web page, you need to know which part or which html tag that you want to get. There are many ways to accomplish it. You can use a basic preg_match command to get (in this case) <img> tag, or use other third party library. In this post i am using phpquery library.

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based onjQuery JavaScript Library. Library is written in PHP5 and provides additional Command Line Interface (CLI).

For you who already have experience using jQuery and not a fan of preg_match, this library fits you. In this example, i am scrapping shop logos from big indonesian marketplace websites, tokopedia.com and bukalapak.com. Enough talking already, show me the codes.
[code]
“http://www.tokopedia.com/bbmurahshop”,
“70” => “http://www.tokopedia.com/every-thing4u”,
“75” => “http://www.tokopedia.com/tridente”,
“9385” => “https://www.tokopedia.com/aserabure”,
“9386” => “https://www.tokopedia.com/sandro”,
“9387” => “https://www.tokopedia.com/nirwana506”,
“9388” => “https://www.tokopedia.com/nirwanaelet”,
“9389” => “https://www.tokopedia.com/tokodachi”,
“9390” => “https://www.tokopedia.com/dereryyy”,
“9392” => “https://www.tokopedia.com/cuterhuye”,
);

set_time_limit(0);

foreach ($urls_to_crawl as $key => $value) {
$doc = phpQuery::newDocument( file_get_contents( $value ));
$image_url = $doc[‘.shop-gold-b_logo’]->find(‘img’)->attr(‘src’);

if(empty($image_url))
$image_url = $doc[‘.shop-header’]->find(‘img’)->attr(‘src’);

$img = ‘images_crawled/shop_’ . $key . ‘.jpg’;
file_put_contents($img, file_get_contents($image_url));

echo ‘Downloaded : ‘ . $img . PHP_EOL;
}
[/code]

You can check the complete script on github page :

https://github.com/hadiariawan/php-image-crawler

That is all. Hope this helps.

1 comment / Add your comment below

Leave a Reply

Your email address will not be published. Required fields are marked *