Reading and Writing RSS Files
You want to create an RSS (Rich Site Summary) file, or read one produced by another application. Handling RSS can be a difficult problem because of multiple incompatible specs calling themselves RSS, the generall looseness of the format, and issues with escaping and encoding content properly. RSS is a case study in how difficult it is to produce valid XML, partly because RSS traditionally includes fragments of HTML, which is marked up text, but not necessarily valid XML.
Let's stipulate that regardless of the RSS format, there's only a few things we're actually interested in: we want to come up with a list of items, where each item contains a date, a URL for the item, a description, and optionally, a title. We'll use
David van Horn's script for scraping the word-a-day RSS Feed from wordsmith.org. David is using the
xml library that ships with PLT, and it seems to work well enough, so we'll go with that.
(require (lib "xml.ss" "xml")
(lib "match.ss")
(lib "url.ss" "net")
(lib "1.ss" "srfi"))
(define (get-rss url)
(xml->xexpr
((eliminate-whitespace '(rss channel item) (lambda (x) x))
(document-element (call/input-url url get-pure-port read-xml)))))
The
get-rss function is used to return an S-Expression from a URL, i.e. retrieve some XML from a URL and convert it into an S-Expression. The
eliminate-whitespace function returns a function that will remove strings containing only whitespace from the elements named in the first argument. This cleans up the S-Expression so the
match expression we'll write is easier; we don't need to account for whitespace, which isn't significant to us anyway.
(define (rss->item rss)
(letrec ((good-item (lambda (p) (and (pair? p) p))))
(filter good-item
(match rss
(('rss _ ('channel _ . items))
(map
(match-lambda
(('item _
('title _ title)
('link _ link)
('description _ . desc ) . _)
(list link title desc))
(('item
('title _ title)
('link _ link)
(_ . _)
('body _ . body))
(list link title body))
(_ '()))
items))
(('rdf:RDF (_ ...) _ ('channel . _). items)
(map
(match-lambda
(('item _
('title _ title)
('link _ link)
('description _ . desc ) . _)
(list link title desc))
(_ '()))
items))
))))
This expanded version of David's match expression has been wrapped in a function.
rss->item will handle RSS 2.0 and 0.91 in the first case and RSS 1.0 (RDF) in the second case. The matching is done in a nested manner, the initial
match finds the items in the <channel> element, and then uses
match-lambda to filter the child items found by the first match. The output of the match is a list of the link, title, and description elements; however, the match can also return an empty list, so we use
SRFI-1's
filter on the output of the whole thing to identify non-empty lists. Our
good-item function returns either #f or the match.
This match is a bit fragile, since RSS doesn't dictate in what order child elements can appear under <item>. However, in practice it works well enough.
Feed Validaton
RSS 2.0 Specification
RSS 1.0 Specification
Parsing RSS At All Costs
HtmlPrag
--
HectorEGomezMorales - 05 May 2004
--
GordonWeakliem - 06 Aug 2004