Extracting Links from an HTML File (PHP Cookbook)

11.9. Extracting Links from an HTML File

11.9.1. Problem

You need to extract the URLs that are specified inside an HTML document.

11.9.2. Solution

Use the pc_link_extractor( ) function shown in Example 11-2.

Example 11-2. pc_link_extractor( )

function pc_link_extractor($s) {
  $a = array();
  if (preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',
                     $s,$matches,PREG_SET_ORDER)) {
    foreach($matches as $match) {
      array_push($a,array($match[1],$match[2]));
    }
  }
  return $a;
}

For example:

$links = pc_link_extractor($page);

11.9.3. Discussion

The pc_link_extractor( ) function returns an array. Each element of that array is itself a two-element array. The first element is the target of the link, and the second element is the text that is linked. For example:

$links=<<<END
Click <a outsideurl=">here</a> to visit a computer book 
publisher. Click <a href="http://www.sklar.com">over here</a> to visit 
a computer book author.
END;

$a = pc_link_extractor($links);
print_r($a);
Array
(
    [0] => Array
        (
            [0] => http://www.oreilly.com
            [1] => here
        )
    [1] => Array
        (
            [0] => http://www.sklar.com
            [1] => over here
        )
)

The regular expression in pc_link_extractor( ) won't work on all links, such as those that are constructed with JavaScript or some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML.

11.9.4. See Also

Recipe 13.8 for information on capturing text inside HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all.