Saturday, November 7, 2009

Scrap data using curl in php

Curl Introduction



curl is the client URL function library. PHP supports it through libcurl. To enable support for libcurl when installing PHP add --with-curl=[location of curl libraries] to the configure statement before compiling. The curl package must be installed prior to installing PHP. Most major functions desired when connecting to remote web servers are included in curl, including POST and GET form posting, SSL support, HTTP authentication, session and cookie handling.



To start a curl session use the curl_init() function. Options for the curl session are set via the curl_setopt() PHP function. Once you have the options set execute the request with the curl_exec() function.


Example:


<? 

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://www.example.com');

curl_setopt($ch, CURLOPT_HEADER, 1);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$data = curl_exec();

curl_close($ch);

 ?>



In the previous example we first set the handle of the curl session to $ch. Then with the curl_setopt command we set the URL of the request to http://www.example.com. The CURLOPT_HEADER option sets whether or not the server response header should be returned. By default curl will display the response straight to the browser as the script is executed. To counter this we enabled the CURLOPT_RETURNTRANSFER option. Now when we run the curl_exec() statement the data returned from the remote server is returned and stored the the $data variable instead of passed to the browser.



Curl and form data



Okay now that we can pull static pages from remote servers lets move into posting information into web forms automatically. The default method of sending form data with curl is GET. In the following example we'll send a text message to our cell phone via the web form. Our example web form will require data in the POST format and contains the fields pNUMBER , MESSAGE, SUBMIT.



<? 

$phoneNumber = '4045551111';

$message = 'This message was generated by curl and php';

$curlPost = 'pNUMBER='  . urlencode($phoneNumber) . '&MESSAGE=' . urlencode($message) . '&SUBMIT=Send';

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/sendSMS.php');

curl_setopt($ch, CURLOPT_HEADER, 1);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch, CURLOPT_POST, 1);

curl_setopt($ch, CURLOPT_POSTFIELDS, $curlPost);

$data = curl_exec();

curl_close($ch);

 ?>



Here we set the phone number and message for the form into variables. The $curlPost variable is being used to store the POST data curl will use. When forming the $curlPost variable which will be used by curl_setopt later be sure to urlencode your data prior to passing it to curl_setopt. CURLOPT_POST is the option used to tell curl to send the form response via the POST method. CURLOPT_POSTFIELDS is the curl option used to store the POST data.


Curl and proxies



As with all other full featured browsers curl has support for proxies. Proxy servers are buffers between the requesting client and the web server. Proxy servers are used for a variety of reasons including companies restricting web access to people wanting to appear anonymous.



There are a few curl options to set when using a proxy. First to enable use of a proxy in curl use the option CURLOPT_HTTPPROXYTUNNEL. Second set the proxy with the option CURLOPT_PROXY. If you need to set authentication information use the option CURLOPT_PROXYUSERPWD. CURLOPT_PROXYUSERPWD expects a string in the format of user:password. HTTP proxies are default, to use a SOCKS proxy use the CURLOPT_PROXYTYPE option.


Example:


<? 

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://www.example.com');

curl_setopt($ch, CURLOPT_HEADER, 1);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);

curl_setopt($ch, CURLOPT_PROXY, 'fakeproxy.com:1080');

curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'user:password');

$data = curl_exec();

curl_close($ch);

 ?>


Using SSL and cookies in curl


SSL



To enable SSL simply change http:// to https:// in the CURLOPT_URL option. Some common options for curl when using SSL are CURLOPT_SSL_VERIFYHOST which when set to 1 verifies the existence of a common name in the certificate and if set to 2 verifies the existence of a common name and that the common name in the certificate matches the host name of the server. CURLOPT_SSLVERSION can be used to switch between SSL version 2 and 3, this is normally auto negotiated by PHP.


Cookies



Curl also has support for cookies. Cookie handling with curl is very simple. There are three options built into curl, CURLOPT_COOKIE, CURLOPT_COOKIEJAR, and CURLOPT_COOKIEFILE. CURLOPT_COOKIE is used to set a cookie for the current session. CURLOPT_COOKIEJAR stores the location of a file to store the cookies received when the session is closed. CURLOPT_COOKIEFILE stores the location of a file containing cookies written in either Netscape format or raw HTTP header style.


Authenticating with curl



All popular methods of HTTP authentication are supported, including HTTP basic, digest, GSS and NTLM. As with other curl setting the curl_setopt function is used. To set the authentication method do curl_setopt(CURLOPT_HTTPAUTH, CURLAUTH_BASIC); this would set basic authentication. Multiple authentication types can be selected via the | or selector. The credentials are set with curl_setopt(CURLOPT_USERPWD, '[username]:[password]').



The other authentication options are CURLAUTH_DIGEST, CURLAUTH_GSSNEGOTIATE, CURLAUTH_NTLM for digest, GSS negotiate, NTLM respectively. Along with those there are also CURLAUTH_ANY which is an alias to CURLAUTH_BASIC|CURLAUTH_DIGEST|CURLAUTH_GSSNEGOTIATE|CURLAUTH_NTLM and CURLAUTH_ANYSAFE which aliases to CURLAUTH_DIGEST|CURLAUTH_GSSNEGOTIATE|CURLAUTH_NTLM.


Example 1:


<? 

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://www.example.com');

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);

curl_setopt(CURLOPT_USERPWD, '[username]:[password]')



$data = curl_exec();

curl_close($ch);

 ?>




Example 2:


<? 

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://www.example.com');

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);

curl_setopt(CURLOPT_USERPWD, '[username]:[password]')



$data = curl_exec();

curl_close($ch);

 ?>



In the first example we are just attempting to authenticate with basic authentication but in the second example we're going to try all possible authentication methods.








No comments:

Post a Comment