s c h e m a t i c s : c o o k b o o k

/ WebProgramming? / Cookbook.WebExtractingAllLinksFromPage

This Web


WebHome 
WebChanges 
TOC (with recipes)
NewRecipe 
WebTopicList 
WebStatistics 

Other Webs


Chicken
Cookbook
Erlang
Know
Main
Plugins
Sandbox
Scm
TWiki  

Schematics


Schematics Home
Sourceforge Page
SchemeWiki.org
Original Cookbook
RSS

Scheme Links


Schemers.org
Scheme FAQ
R5RS
SRFIs
Scheme Cross Reference
PLT Scheme SISC
Scheme48 SCM
MIT Scheme scsh
JScheme Kawa
Chicken Guile
Bigloo Tiny
Gambit LispMe
GaucheChez

Lambda the Ultimate
TWiki.org

Fetching all links from a page

Problem

Given an url, fetch the page, parse it and extract all links.

Solution

We will use string->url and get-pure-port from url.ss to fetch the page. For parsing, html->sxml from the permissive HTML parser library HtmlPrag is used. Then sxml->links makes a recursive descent of the parse tree collecting all links it discovers.

(require (lib "htmlprag.ss" "htmlprag")
         (lib "url.ss" "net")
         (lib "match.ss"))

; url->sxml : url-string -> sxml
;  fetch the page with the url url, parse it and return the parse tree
(define (url->sxml url)
  (html->sxml (get-pure-port (string->url url))))

; sxml->links : sxml -> (list (list url-string string))
;  return a list of all links in the parse tree
(define (sxml->links sxml)
  (let loop ([sxml  sxml]
             [links '()])
    (match sxml
      [('a ('@ ((not 'href ) _) ... ('href url)  . more) text)            
       (cons (list url text) links)]
      [(item ...)                        
       (append links 
               (apply append 
                      (map sxml->links sxml)))]
      [else                              
       links])))

> (sxml->links (url->sxml "http://www.neilvandyke.org/htmlprag/"))
(("htmlprag.scm" "file htmlprag.scm")
 ("htmlprag.html" "file htmlprag.html")
 ("htmlprag.pdf" "file htmlprag.pdf")
 ("htmlprag-0-7.plt" "file htmlprag-0-7.plt"))

> (url->sxml "http://www.neilvandyke.org/htmlprag/")
(*TOP*
 (*PI* xml "version=\"1.0\" encoding=\"UTF-8\"") "\n"
 (*DECL* DOCTYPE ...
 (html ...
  (head ...
   (title "Neil W. Van Dyke: HtmlPrag")   "\n")  "\n"
  (body ...
   (p (a (@ (href "http://www.neilvandyke.org/") 
            (title "Home page of Neil W. Van Dyke") 
            (accesskey "0"))
     (b "neilvandyke")
     ".org"))
    ...

Discussion

Get HtmlPrag from http://www.neilvandyke.org/htmlprag/. Installation instructions.

This code handles the general case where href can occur anywhere in the attribute list:

   ('a                               ; a link
       ('@                           ; with a list of attributes
           ((not 'href ) _) ...      ; starting with zero or more non-hrefs
                ('href url)          ; then a href with the url
                . more)              ; then some more
       text)                         ; and finally the text of displayed

-- JensAxelSoegaard - 19 Apr 2004

-- GordonWeakliem - 20 Apr 2004

Comments

As a matter of interest, I once did something similar with SCSH, using its s-exp based regular expressions (SRE). The actual regular expression used to match the links was:

(rx (w/nocase (: (or "href" "file") (* whitespace) "=" (* whitespace) "\""
                 (submatch (* (~ ("?\"")))) "\"")))

There's a bit more about this, with a more readable version of the above, in this post.

You could do the same thing in PLT with a string-based regexp. Of course, HTMLPrag is likely to scale much better to more complex requirements.

-- AntonVanStraaten - 19 Apr 2004

One problem with this solution is that it works only for anchor tags that have href as the first attribute.

(define (sxml->links sxml)
  (let loop ([sxml  sxml]
             [links '()])
    (match sxml
      [('a ('@ ((not 'href ) _) ... ('href url) . more) text) (cons (list url text) links)]
      [(item ...) (append links 
                          (apply append 
                                 (map sxml->links sxml)))]
      [else links])))
FYI, I had to (require (lib "match.ss")) to get this example to work.

-- GordonWeakliem - 20 Apr 2004

I have changed the solution to use your match expression and added an explanation of what's going on.

-- JensAxelSoegaard - 24 Apr 2004

CookbookForm
TopicType: Recipe
ParentTopic: WebRecipes
TopicOrder: 020

 
 
Copyright © 2004 by the contributing authors. All material on the Schematics Cookbook web site is the property of the contributing authors.
The copyright for certain compilations of material taken from this website is held by the SchematicsEditorsGroup - see ContributorAgreement & LGPL.
Other than such compilations, this material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.
Ideas, requests, problems regarding Schematics Cookbook? Send feedback.
/ You are Main.guest