regex - PHP Web Crawler doesn't crawl .php files -

- May 15, 2010

this simple webcrawler trying build

<?php      $to_crawl = "http://samplewebsite.com/about.php";      function get_links($url)     {         $input = @file_get_contents($url);         $regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";         preg_match_all("/$regexp/siu", $input, $matches);          $l = $matches[2];          foreach ($l $link) {             echo $link."</br>";         }     }       get_links($to_crawl);   ?>

when try run script $to_crawl variable set url ending file name, e.g. "facebook.com/about", works, reason, echo's nothing when link ending '.php' filename. can please help?

to links , inner texts, can use domdocument this:

$dom = new domdocument; @$dom->loadhtml($input);                    // input (html code)  $xp = new domxpath($dom); $links = $xp->query('//a[@href]');          // xpath <a> tags href attribute  $result = array(); foreach ($links $link) {     $result[] = array($link->getattribute("href"), $link->nodevalue); } print_r($result);

see ideone demo

Search This Blog

ITEMscalal

regex - PHP Web Crawler doesn't crawl .php files -

Comments

Post a Comment

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

unity3d - In a Unity canvas a button and an image hide each other even though they don't overlap -

c# - Get rid of xmlns attribute when adding node to existing xml -