s c h e m a t i c s : c o o k b o o k

/ Cookbook.FileReadRandomLine

This Web


WebHome 
WebChanges 
TOC (with recipes)
NewRecipe 
WebTopicList 
WebStatistics 

Other Webs


Chicken
Cookbook
Erlang
Know
Main
Plugins
Sandbox
Scm
TWiki  

Schematics


Schematics Home
Sourceforge Page
SchemeWiki.org
Original Cookbook
RSS

Scheme Links


Schemers.org
Scheme FAQ
R5RS
SRFIs
Scheme Cross Reference
PLT Scheme SISC
Scheme48 SCM
MIT Scheme scsh
JScheme Kawa
Chicken Guile
Bigloo Tiny
Gambit LispMe
GaucheChez

Lambda the Ultimate
TWiki.org

Retrieving a Line at Random from a File of Unknown Size

Problem

You have a file of unknown size, and you need to select one line at random from it. The file may be very large, so you don't want to put it all in memory at once, and want to do this in a single pass.

Solution

;; random-line : (U string path) -> string
;;
;; Returns a random line from the given file, using only a single pass through the file.
;; Each line has an equal chance of being chosen.
(define (random-line/stdin)
  (let reader ([chance 1] ; chance that a line becomes the new result
               [result ""]) ; current result if there are no more lines
    (cond [(eof-object? (peek-char)) result] ; end of file, return current result
          [(zero? (random chance)) ; this line is the new result
           (reader (add1 chance) (read-line (current-input-port) 'any))] ; continue
          [else (read-line (current-input-port) 'any) ; discard a line
           (reader (add1 chance) result)]))) ; continue
  

(define (random-line filename)
  (with-input-from-file filename random-line/stdin))

Discussion

How this works: This program stores one line, it's eventual result, at all times. It continues reading lines until it comes to the end of the file; when it reads line n there is a 1/ n chance that it replaces the result, and an ( n -1)/ n chance that the result remains the same.

As an example, consider the case that there are 3 lines in the file. During the first iteration, there is a 1/1 chance that the line read is stored as the new result. During the second iteration, there is a 1/2 chance that the line read is stored as the new result, and a 1/2 chance that the first line remains. During the third iteration, there is a 1/3 chance that the third line is stored as the result, and a 2/3 chance that the previous result remains. 2/3 * 1/2 = 1/3, thus there is a cumulative 1/3 chance that the first line will be the result and a cumulative 1/3 chance that the second line will be the result. On the fourth iteration, the result (which has an equal probability of being any of the three lines read so far) is returned.


Comments about this recipe

If the record size is guaranteed to be constant, this problem has a much simpler solution. However, I assumed that you meant for this function to get a random line from any text file. I apologize if the code style isn't Cookbook Convention; this is my first edit. Feel free to tweak it.

And of course, the program has been tested. I had it draw 3000 random liens from a 6 line file, and got each line ~500 times, with no apparent favoritism toward earlier or later lines. This function WOULD fail if the number of lines in the file is enough to make the random number generator begin giving skewed results, ie: a significant fraction of 2147483647.
-- BrianJ - 11 Dec 2006

Nice recipe BrianJ! I've added a comment for the code. You might want to move your discussion of the algorithm into the Discussion section above, and the code could use with-input-from-file, making it more robust against errors. It is also worth noting that reserviour sampling can be used for many other situations. For example, I've used it draw random sub-trees from a tree. I was first shown this trick by OlegK.

-- NoelWelsh - 11 Dec 2006

Thanks for your comments. I'm a little busy due to exams tomorrow, but I'll try to implement the rest of your suggestions soon. UPDATE: I used with-input-from-file, as requested.

Contributors

-- GordonWeakliem - 13 Aug 2004

-- BrianJ - 11 Dec 2006

CookbookForm
TopicType: Recipe
ParentTopic: FileRecipes
TopicOrder: 999

 
 
Copyright © 2004 by the contributing authors. All material on the Schematics Cookbook web site is the property of the contributing authors.
The copyright for certain compilations of material taken from this website is held by the SchematicsEditorsGroup - see ContributorAgreement & LGPL.
Other than such compilations, this material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.
Ideas, requests, problems regarding Schematics Cookbook? Send feedback.
/ You are Main.guest