s c h e m a t i c s : c o o k b o o k

/ PatternMatching? / Cookbook.RegexMatchingWords

This Web


WebHome 
WebChanges 
TOC (with recipes)
NewRecipe 
WebTopicList 
WebStatistics 

Other Webs


Chicken
Cookbook
Erlang
Know
Main
Plugins
Sandbox
Scm
TWiki  

Schematics


Schematics Home
Sourceforge Page
SchemeWiki.org
Original Cookbook
RSS

Scheme Links


Schemers.org
Scheme FAQ
R5RS
SRFIs
Scheme Cross Reference
PLT Scheme SISC
Scheme48 SCM
MIT Scheme scsh
JScheme Kawa
Chicken Guile
Bigloo Tiny
Gambit LispMe
GaucheChez

Lambda the Ultimate
TWiki.org

Matching Words

Problem

You want to select words from a string.

Solution

Determine the defining features of a word for your specific application, then write a regular expression that models this idea.

(define words-1
  (regexp "[^ ]+"))        ; as many non-whitespace bytes as possible
(define words-2
  (regexp "[A-Za-z'-]+"))  ; as many letters, apostrophes, and hyphens

> (regexp-match words-1 "'alpha-beta gamma")
("'alpha-beta")
> (regexp-match words-2 "'alpha-beta&or gamma")
("'alpha-beta")

Discussion

Scheme does not have a built-in definition for words in strings. On the one hand, this is inconvenient since you have to define your own meaning of "word". On the other hand, this is the correct behavior since the concept of words varies significantly between applications, locales, encodings, and input source.

The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.

The Perl-compatible regular expression module supports all of Perl's constructs (with the one proviso that escaped characters, such as \b must receive two escape slashes to be parsed properly, i.e., \\b.) Using the pregexp module, we can search based on word boundaries:

; Using word-boundary \b:
> (pregexp-match
    (pregexp "\\b([A-za-z]+)\\b") "The quick brown fox")
("The" "The")
> (pregexp-match
    (pregexp "\\b([A-za-z]+)\\b") "ended.  Then we walked")
("ended" "ended")
; Using whitespace.  Fails on punctuation, so not desirable:
> (pregexp-match
    (pregexp "\\s([A-za-z]+)\\s") "The quick brown fox")
(" quick " "quick")
> (pregexp-match
    (pregexp "\\s([A-za-z]+)\\s") "ended.  Then we walked")
(" Then " "Then")

The pregexp provides the "word" character set \w, which matches a character that is part of a valid Perl identifier. However, this just means a string of alphanumerics and underscores. This is generally not what you want.

Note that \\b and \\B are still useful. For example, "\\Bis\\B" matches the string "is" within a word, but not at the edges. So, while "whistle" would match, "this" would not.

References


Comments about this recipe

Contributors

-- BrentAFulgham - 18 May 2004

CookbookForm
TopicType: Recipe
ParentTopic: PatternMatching?
Other Parents:
Next Topic: PattMatchCommentRegExp?

 
 
Copyright © 2004 by the contributing authors. All material on the Schematics Cookbook web site is the property of the contributing authors.
The copyright for certain compilations of material taken from this website is held by the SchematicsEditorsGroup - see ContributorAgreement & LGPL.
Other than such compilations, this material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.
Ideas, requests, problems regarding Schematics Cookbook? Send feedback.
/ You are Main.guest