I’m working on another idea which I hope to release soon which involves scraping websites using PHP and cURL.
I don’t want to give too much away before I release the website so I won’t go into too much detail. However, what I can tell you is that it required me to go out and get a lot of data from external websites using variables passed through from a form on my end.
I originally started out using a piece of python software called Scrapy which worked very well, but the logistics of using that and either storing the data or displaying it on a webpage became too much of a hassle so I instead opted to go for PHP and cURL.
For the PHP side I’m using the framework Codeigniter which is a very easy and very speedy framework which is perfect for what I wanted to do.
The basic flow of how everything works is:
- The form is filled out and submitted
- Data from the form is sent to the external website in a cURL request
- The webpage content is then returned
- From there the data can be formatted and displayed accordingly
To do this with PHP and cURL is a fairly straight forward process and I’ll show you how to go about it. The only real issue you may come across is that when forms come into play you need to make sure each and every form element is included in the call.
$url = 'http://www.website.com/login.php';
$postdata = array('username' => "Jamie",'password' => "password");
$ch = curl_init();
if($ch){
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); // set cookie file to given file
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); // set same file as cookie jar
$content = curl_exec($ch);
$headers = curl_getinfo($ch);
curl_close($ch);
// Debug option
// print_r($headers);
if($headers['http_code'] == 200){
echo $content;
}
}
PHPThat’s the entire call and will return the html contents of website.com/login.php. I’ll go through the above code piece by piece and give a run down on each of the different parts.
$url = 'http://www.website.com/login.php';
$postdata = array('username' => "Jamie", 'password' => "password");
PHPFirstly, the url variable should be self explanatory and the postdata is just a simple array which contains the form elements that are required to login with (in this case a username and password).
$ch = curl_init();if($ch){
PHPCreate a new curl object and if all is well continue on.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); // set cookie file to given file
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); // set same file as cookie jar
PHPThese are all curl options which I am going to use. You can view the rest of the different options over at the php website. The ones we are using and basically all to do with logging in. Storing the cookies and passing through the post data are the main ones to take note of.
$content = curl_exec($ch);
$headers = curl_getinfo($ch);
curl_close($ch);
PHPLast but not least, execute the curl request passing through the options we used, set the returned content to a variable and also grab the headers before finally closing the curl object.
As you will note in the original code I use the headers variable as a debug option. This is very handy, in particular the header_code which can be very useful. If you ever find that something isn’t working, double check that you are getting a 200 code and not a 400/501.
From there you can grab/scrape the content and data to your hearts content. A great thing is now that you have received and stored the cookies from logging in, you have access to ‘authenticated only’ sections of the website. So you can go away and run more curl requests to get those areas of the website.
I was about to end it there but one other important piece I have come across is that some forms that you fill out will actually re-direct you to different parts of the website after submit. It’s fairly easy to identify because you will get a header code of 302 and the great thing is that you also get a redirect_url in the headers. All you need to do is make another curl request using the redirect url you received.
// original curl request up here
if ($headers['http_code'] == 302){
$ch = @curl_init();
curl_setopt($ch, CURLOPT_URL, $headers['redirect_url']);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); // set cookie file to given file
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); // set same file as cookie jar
$content = curl_exec($ch);
}
PHPOnce you have gotten the content you require you then need to get access to the specific data or text you’re after. I’m going to show you how to do this using xpath in another post, so keep your eyes out for that.
Like always, if you have any comments or questions feel free to post and I’ll do my best to answer ’em.
Follow me on twitter @JAGracie