regex - PHP Web Crawler doesn't crawl .php files -
this simple webcrawler trying build
<?php $to_crawl = "http://samplewebsite.com/about.php"; function get_links($url) { $input = @file_get_contents($url); $regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> "; preg_match_all("/$regexp/siu", $input, $matches); $l = $matches[2]; foreach ($l $link) { echo $link."</br>"; } } get_links($to_crawl); ?>
when try run script $to_crawl variable set url ending file name, e.g. "facebook.com/about", works, reason, echo's nothing when link ending '.php' filename. can please help?
to links , inner texts, can use domdocument
this:
$dom = new domdocument; @$dom->loadhtml($input); // input (html code) $xp = new domxpath($dom); $links = $xp->query('//a[@href]'); // xpath <a> tags href attribute $result = array(); foreach ($links $link) { $result[] = array($link->getattribute("href"), $link->nodevalue); } print_r($result);
see ideone demo
Comments
Post a Comment