11.9. Extracting Links from an HTML File11.9.2. SolutionUse the pc_link_extractor( ) function shown in Example 11-2. Example 11-2. pc_link_extractor( )function pc_link_extractor($s) { $a = array(); if (preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i', $s,$matches,PREG_SET_ORDER)) { foreach($matches as $match) { array_push($a,array($match[1],$match[2])); } } return $a; } For example: $links = pc_link_extractor($page); 11.9.3. DiscussionThe pc_link_extractor( ) function returns an array. Each element of that array is itself a two-element array. The first element is the target of the link, and the second element is the text that is linked. For example: $links=<<<END Click <a outsideurl=">here</a> to visit a computer book publisher. Click <a href="http://www.sklar.com">over here</a> to visit a computer book author. END; $a = pc_link_extractor($links); print_r($a); Array ( [0] => Array ( [0] => http://www.oreilly.com [1] => here ) [1] => Array ( [0] => http://www.sklar.com [1] => over here ) ) The regular expression in pc_link_extractor( ) won't work on all links, such as those that are constructed with JavaScript or some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML. 11.9.4. See AlsoRecipe 13.8 for information on capturing text inside HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all. Copyright © 2003 O'Reilly & Associates. All rights reserved. |
|