e r l a n g : c o o k b o o k

/ Erlang.StringCSV

This Web


WebHome 
WebChanges 
TOC
NewRecipe 
WebTopicList 
WebStatistics 

All Webs


Chicken
Cookbook
Erlang
Know
Main
Plugins
Sandbox
Scm
TWiki  

Erlang Links


Erlang.org
Erlang Wiki
ErlMan
Erlang Wiki
The Jungerl
Erlang-fr.org
Joe Armstrong
Lambda the Ultimate

Erlang Web Ring


[Prev]: Joe Armstrong's Page
[Next]: Joe Armstrong's Page

Processing Comma Separated Values

Problem

You want to work with comma separated value records, such as those exported from popular spreadsheet and database programs. Note that this recipe works for any similarly delimited format.

Solution

The naive implementation might be something like this:
1> string:tokens(csv-string, ",").
One shortcoming here is that 'empty' fields are gobbled up:
1> string:tokens("some,fields,,are,,empty,", ",").
["some","fields","are","empty"]
The following explode function will solve this problem:
explode(Instr, Token) ->
    explode(Instr, Token, []).

explode(Instr, Token, Result) ->
    Tpos = string:rstr(Instr,Token),
    if Tpos =:= 0 ->
        [Instr | Result];
    true ->
        NewResult = [string:substr(Instr,Tpos+string:len(Token)) | Result],
        explode(string:substr(Instr,1,Tpos-1),Token,NewResult)
    end.
Which gives us:
1> mymodule:explode("some|||fields||||||are||||||empty|||", "|||").
["some","fields",[],"are",[],"empty",[]]
Other shortcomings show up pretty quickly. For starters, CSV format often encloses fields in quotes, so fields can contain commas. On top of that, quoted fields can contain quotes, escaped with a backslash (#\\). Let's imagine a CSV format for books, where the format is author,title,ISBN,publisher:
2> Csv = "David Halberstam, \"War in a Time of Peace: Bush, Clinton,
2> and the Generals\", B0000C37EA, Scribner".
"David Halberstam, \"War in a Time of Peace: Bush, Clinton, and the
Generals\", B0000C37EA, Scribner".
3> string:tokens(Csv, ",").
["David Halberstam",
 " \"War in a Time of Peace: Bush",
 " Clinton",
 " and the Generals\"",
 " B0000C37EA",
 " Scribner"]
Clearly, the easy solution won't work for the general case. Essentially, we need to do state machine processing for this. As we traverse the string, we'll encounter the following states: in_field (when the current position is inside a field), in_quote (when we're inside a quoted string), delim (when we encounter a delimiter), and escape_char (when we encounter a backslash).

Discussion

A set of Scheme utilities (written by NeilVanDyke) handles this problem quite nicely, but is obviously not applicable to Erlang directly. This is a good idea for an Erlang library, and could be ported to Erlang relatively easily.

Contributors

Based on work by GordonWeakliem.

-- BrentAFulgham - 23 Aug 2004

-- JosephWecker - 17 May 2007

CookbookForm
TopicType: Recipe
ParentTopic: StringRecipes
TopicOrder: 220

 
 
Copyright © 2004 by the contributing authors. All material on the Erlang Cookbook web site is the property of the contributing authors.
This material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.
Ideas, requests, problems regarding Schematics Cookbook? Send feedback.
/ You are Main.guest