What is best approach to parsing CSV string to tab-delimited string?
In PHP, I'm able to utilize the implode and str_getcsv functions to convert a comma-delimited string to a tab-delimited string. The following is an example.
$strtotab = implode("\t", str_getcsv($str, ","));
I am wondering if there is a similar way to do this in Cache and ensure that it takes into account fields enclosed in quotes. I am away of the CSVTOCLASS() method in the %SQL.Util.Procedures class. As I understand it, this is used to convert an entire file rather than a single string. The goal is to be compliant with RFC 4180.
To provide an example, the following comma-delimited string should be viewed as having 5 pieces, and not 7.
ABC Company,"123 Main St, Ste 102","Anytown, DC",10001,234-567-8901
Any guidance would be appreciated.
$tr(str,",",$c(9))
or
$replace(str,",",$c(9))
This doesn't take into account fields enclosed in double quotes. The goal is to remain compliant with RFC 4180.
Are the %occLibrary entry points documented anywhere as being available for use by applications?
This works beautifully.
Are you aware of where the %occLibrary documentation may be referenced?
I discourage since years the use of $ZU(...) functions as they aren't documented since 2010.
I recently had to dig back to 2009 for just a weak hint what might happen.
It is even worse with all the internal stuff around %occ* and similar.
No docs. No guaranty of the life cycle. No (external) notice of eventual changes. Mostly as a deployed code.
If it is used inside a $system.* or part of a generated code that's OK. The responsibility is not at the user side.
Verifying those "specials" with every release change can be a very heavy exercise.
(just experiencing this on a system locked down to on an older version unable to migrate)
Typically, I've dealt with XML, EDI, and tab-delimited files within Cache. All CSV files were typically done in PHP. It's definitely a great exercise.
I did want to note that the above solution did not work with the following string.
CSIN-01,01/01/2002,01/15/2003,,1,1002381,ABC CORP,02/22/2018,"NET ""15""","ANYTOWN, DC",JANE DOE,"ANYTOWN, DC",06/13/2018,$13900.00 ,$101.15 ,$5.13 ,$308.00 ,$281.70 ,$0.00 ,$0.00 ,$217.00 ,JOHN DOE
Nice work Gerd!
GREAT !
all well-documented code!
no (dirty) Harry_Potter_Coding !
I was sure you know it
This is great except in the instance where you don't know the descriptors. We utilize CSVTOCLASS for situations where we do.
This wouldn't account for double quotes within the field data.
I suggest the following strategy:
In the above algorithm you may want to cater for the case where concatenation of consecutive lines would exceed the maximum local string length.
Maybe the name of the topic to specify a CSV-string instead of string ?
After clarification, the question became more clear.
The initial post advised that it was to take into account fields enclosed in double quotes. CSV stands for comma-separated value, so I'm not sure how that would change the information.
I did add the example since it appeared someone may not understand what taking into account double quotes means.
Mack- the %occLibrary calls are not documented for use by applications.
As your very interesting questions has proven the ease with which this is solved by:
$$CSVtoList^%occLibrary(csv)
I suggest you contact the WRC to request an enhancement to have this wrapped in a supported (and documented!) API that may be more future-proof than this internal and undocumented call.
Here's the next challenge, and one that I'm encountering using the $$CSVtoList^%occLibrary... line feeds within the field data (enclosed in double quotes).
The line ends with CRLF, and within the double quote enclosed field there appears 
 followed by a Line feed. The RFC 4180 standard states this should be acceptable, but I'm reviewing how to review this now. Any thoughts?
The Unix/Linux world often uses LF := $C(10) as line terminator
while in Win (and VMS) world CRLF := $C(C13,10) is a default.
So you depend on the source system providing the data.
Suggested approach: use LF as the (common) line terminator and just drop $C(13) or 
 from your input record by
$replace($translate(record,$c(13)),"
","")
Before any other processing.
I was able to find in the documentation where it's stated that multi-line characters unsupported.
I attempted to use the following as the line terminator, which is in fact a CRLF. For some reason, it terminates even when only the LF occurs.
S fi.LineTerminator = $c(13,10)
I'd like to know this as well. It would definitely save me a lot of time by referencing this documentation.
you are right. I didn't think on empty parts and doubled double quotes
which I never met from CSV.
next level exercise
OK, this handles double quotes. But only INSIDE a quoted string
set res="",newsep=$g(newsep,$c(9)),newquote=$g(newquote,quote)
for {
if $g(string)[sep {
if $e(string)=quote {
set string=$replace(string,"""""",$c(2)) ; exclude double quotes
set part=$P(string,quote,2)
,string=$replace($p(string,part_quote_sep,2,*),$c(2),"""""")
,res=res_newquote_$replace(part,$c(2),"""""")_newquote_newsep }
else {
set part=$P(string,sep),string=$p(string,sep,2,*)
,res=res_part_newsep }
} else {
set res=res_$g(string) quit }
}
quit res
}
-
HTH
Mack, can you give me an example of the kind of CSV line that you mean here?
No.
If you beware to use %occLibrary, you may look into side %Regex.Matcher
Here is another documented solution:
Result:
Well, the documentation leaves a lot of room for improvement
and you can sure write this into a much shorter version but I wanted to make sure folks get the (somewhat abstract) idea.
When your line is $list'ed you can be sure further processing is not affected by delimiters. I also have to say that my solution cannot compete with John (Murray)'s crisp & elegant one-liner.
Result:
not being verbose in %occ* world I had this solution also allowing to change quoting
set res="",newsep=$g(newsep,$c(9)),newquote=$g(newquote,quote)
for {
if $g(string)[sep {
if $e(string)=quote {
set part=$P(string,quote,2),string=$p(string,part_quote_sep,2,*)
,res=res_newquote_part_newquote_newsep }
else {
set part=$P(string,sep),string=$p(string,sep,2,*)
,res=res_part_newsep }
} else {
set res=res_$g(string) quit }
}
quit res
}
BTW.
It's an excellent test exercise for new COS programmers
I'll add it to my collection.
Thanks
You could use CSV() like this... (sorry for the weird formatting)
isdDemoCSVToList
quit
Test() public
{
set tDescriptor = "ROW(STR VARCHAR,STR2 VARCHAR,STR3 VARCHAR,STR4 VARCHAR, STR5 VARCHAR)"
set tLine = "ABC Company,""123 Main St, Ste 102"",""Anytown, DC"",10001,234-567-8901"
do CSVToList(tLine,tDescriptor,.tOutput)
write !
zw tOutput
kill tOutput
for i=1:1:3
{
set tLineArray($i(tLineArray)) = """eins"",""deux"",line"_i
}
do CSVToList(.tLineArray,tDescriptor,.tOutput)
write !
zw tOutput
}
CSVToList(&pLines,pDescriptor,&pListArray) public
{
#dim tRS as %SQL.StatementResult
kill tListArray
if ($d(pLines)<10) { set pLines(1)=pLines,pLines=1 }
set tTempStream = ##class(%Stream.TmpCharacter).%New()
for tIdx = 1 : 1 : pLines
{
do tTempStream.WriteLine(pLines(tIdx))
}
do tTempStream.Rewind()
set tRS = ##class(%SQL.Statement).%ExecDirect(,"call %SQL_Util.CSV(,?,?)",.pDescriptor,tTempStream).%NextResult()
while (tRS.%Next(.tSC))
{
set tList = ""
for tCol = 1 : 1 : tRS.%GetMetadata().columnCount
{
set tValue = tRS.%GetData(tCol)
set tList = tList _ $lb(tValue)
}
set pListArray($i(pListArray))=tList
}
quit
}
Your output then looks like this
tOutput(1)=$lb("ABC Company","123 Main St, Ste 102","Anytown, DC","10001","234-567-8901")
tOutput=3
tOutput(1)=$lb("eins","deux","line1","","")
tOutput(2)=$lb("eins","deux","line2","","")
tOutput(3)=$lb("eins","deux","line3","","")
My proposal:
Without abbreviations:USER>w csv ABC Company,"123 Main St, Ste 102","Anytown, DC",10001,234-567-8901 USER>s i=0 USER>f s i=$f(csv,",",i) q:'i i $l($e(csv,1,i-2),"""")#2 s $e(csv,i-1)=$c(9) USER>w csv ABC Company "123 Main St, Ste 102" "Anytown, DC" 10001 234-567-8901 USER>
USER>write csv ABC Company,"123 Main St, Ste 102","Anytown, DC",10001,234-567-8901 USER>set i=0 USER>for set i=$find(csv,",",i) quit:'i if $length($extract(csv,1,i-2),"""")#2 set $extract(csv,i-1)=$char(9) USER>write csv ABC Company "123 Main St, Ste 102" "Anytown, DC" 10001 234-567-8901 USER>
Class IRISDemo.Util.FileReader Extends %RegisteredObject { /// Use this Query with %ResultSet class to read comma (,) separated files. Query CSV(pFileName As %String, pFileEncoding As %String = "UTF8") As %Query { } /// Use this Query with %ResultSet class to read semicolon (;) separated files. Query CSV2(pFileName As %String, pFileEncoding As %String = "UTF8") As %Query { } /// Use this Query with %ResultSet class to read tab separated files. Query TSV(pFileName As %String, pFileEncoding As %String = "UTF8") As %Query { } /// Index a header from a CSV file so we can later get every row of the file /// and read every column by it's name. This method supports fields quoted, double /// quoted or not quoted or a mix of the three. The only thing that is required for this to work /// is that the correct field separator is specified (Comma, Semicolon or Tab). ClassMethod IndexHeader(pHeader As %String, pSeparator As %String = ",", Output pHeaderIndex) { //How many columns? Set pHeaderIndex=$Length(pHeader,pSeparator) Set tRegexSeparator=pSeparator If tRegexSeparator=$Char(9) { Set tRegexSeparator="\x09" } //Let's build a regular expression to read all the data without the quotes... Set pHeaderIndex("Regex")="" For i=1:1:pHeaderIndex { Set $Piece(pHeaderIndex("Regex"),tRegexSeparator,i)="\""?'?(.*?)\""?'?" } Set pHeaderIndex("Regex")="^"_pHeaderIndex("Regex") //Let's use this regular expression to index the column names... Set oMatcher = ##class(%Regex.Matcher).%New(pHeaderIndex("Regex")) Do oMatcher.Match(pHeader) //Now let's index the colum names For i=1:1:oMatcher.GroupCount { Set pHeaderIndex("Columns",i)=$ZStrip(oMatcher.Group(i),"<>W") } } ClassMethod IndexRow(pRow As %String, ByRef pHeaderIndex As %String, Output pIndexedRow As %String) { Set oMatcher = ##class(%Regex.Matcher).%New(pHeaderIndex("Regex")) Do oMatcher.Match(pRow) //Now let's index the colum names For i=1:1:oMatcher.GroupCount { Set $List(pIndexedRow,i)=oMatcher.Group(i) } } ClassMethod CSVGetInfo(colinfo As %List, parminfo As %List, idinfo As %List, qHandle As %Binary, extoption As %Integer = 0, extinfo As %List) As %Status { Quit ..FileGetInfo(.colinfo, .parminfo, .idinfo, .qHandle, .extoption, .extinfo) } ClassMethod CSV2GetInfo(colinfo As %List, parminfo As %List, idinfo As %List, qHandle As %Binary, extoption As %Integer = 0, extinfo As %List) As %Status { Quit ..FileGetInfo(.colinfo, .parminfo, .idinfo, .qHandle, .extoption, .extinfo) } ClassMethod TSVGetInfo(colinfo As %List, parminfo As %List, idinfo As %List, qHandle As %Binary, extoption As %Integer = 0, extinfo As %List) As %Status { Quit ..FileGetInfo(.colinfo, .parminfo, .idinfo, .qHandle, .extoption, .extinfo) } ClassMethod FileGetInfo(colinfo As %List, parminfo As %List, idinfo As %List, qHandle As %Binary, extoption As %Integer = 0, extinfo As %List) As %Status { Merge tHeaderIndex = qHandle("HeaderIndex") Set colinfo = "" For i=1:1:tHeaderIndex { Set tColName = tHeaderIndex("Columns",i) Set colinfo=colinfo_$LB($LB(tColName)) } Set parminfo=$ListBuild("pFileName","pFileEncoding") Set extinfo="" Set idinfo="" Quit $$$OK } ClassMethod CSVExecute(ByRef qHandle As %Binary, pFileName As %String, pFileEncoding As %String = "UTF8") As %Status { Set qHandle("Separator")="," Quit ..FileExecute(.qHandle, pFileName, pFileEncoding) } ClassMethod CSV2Execute(ByRef qHandle As %Binary, pFileName As %String, pFileEncoding As %String = "UTF8") As %Status { Set qHandle("Separator")=";" Quit ..FileExecute(.qHandle, pFileName, pFileEncoding) } ClassMethod TSVExecute(ByRef qHandle As %Binary, pFileName As %String, pFileEncoding As %String = "UTF8") As %Status { Set qHandle("Separator")=$Char(9) Quit ..FileExecute(.qHandle, pFileName, pFileEncoding) } ClassMethod FileExecute(ByRef qHandle As %Binary, pFileName As %String, pFileEncoding As %String = "UTF8") As %Status { Set tSC = $System.Status.OK() Try { Set oFile = ##class(%Stream.FileCharacter).%New() If pFileEncoding'="" Set oFile.TranslateTable=pFileEncoding Set oFile.Filename=pFileName Set tHeader=oFile.ReadLine() Do ..IndexHeader(tHeader, qHandle("Separator"), .tHeaderIndex) Merge qHandle("HeaderIndex")=tHeaderIndex Set qHandle("File")=oFile } Catch (oException) { Set tSC = oException.AsStatus() } Quit tSC } ClassMethod CSVClose(ByRef qHandle As %Binary) As %Status [ PlaceAfter = FileExecute ] { Quit ..FileClose(.qHandle) } ClassMethod CSV2Close(ByRef qHandle As %Binary) As %Status [ PlaceAfter = FileExecute ] { Quit ..FileClose(.qHandle) } ClassMethod TSVClose(ByRef qHandle As %Binary) As %Status [ PlaceAfter = FileExecute ] { Quit ..FileClose(.qHandle) } ClassMethod FileClose(ByRef qHandle As %Binary) As %Status [ PlaceAfter = FileExecute ] { #Dim oFile As %Library.FileCharacterStream Set tSC = $System.Status.OK() Try { Kill qHandle("File") Kill qHandle("HeaderIndex") } Catch (oException) { Set tSC = oException.AsStatus() } Quit tSC } ClassMethod CSVFetch(ByRef qHandle As %Binary, ByRef Row As %List, ByRef AtEnd As %Integer = 0) As %Status [ PlaceAfter = FileExecute ] { Quit ..FileFetch(.qHandle, .Row, .AtEnd) } ClassMethod CSV2Fetch(ByRef qHandle As %Binary, ByRef Row As %List, ByRef AtEnd As %Integer = 0) As %Status [ PlaceAfter = FileExecute ] { Quit ..FileFetch(.qHandle, .Row, .AtEnd) } ClassMethod TSVFetch(ByRef qHandle As %Binary, ByRef Row As %List, ByRef AtEnd As %Integer = 0) As %Status [ PlaceAfter = FileExecute ] { Quit ..FileFetch(.qHandle, .Row, .AtEnd) } ClassMethod FileFetch(ByRef qHandle As %Binary, ByRef Row As %List, ByRef AtEnd As %Integer = 0) As %Status [ PlaceAfter = FileExecute ] { Set tSC = $System.Status.OK() Try { Set Row = "" Set oFile = qHandle("File") If oFile.AtEnd { Set AtEnd=1 Quit } Merge tHeaderIndex=qHandle("HeaderIndex") While 'oFile.AtEnd { Set tRow=oFile.ReadLine() Continue:tRow="" Quit } Do ..IndexRow(tRow, .tHeaderIndex, .Row) } Catch (oException) { Set tSC = oException.AsStatus() } Quit tSC } }
It's amazing how many different ways everyone's taken to read a standard format. I thought I'd add this excerpt from the RFC 4180. So far, I haven't found a way to read a CSV line that takes into account all of the requirements of the escaped field, specifically with the LF character. To see this for yourself, create a new Excel workbook, data into a field and use Alt+Enter a few times, and then save the file as a CSV file. The line feeds are within the field data.
I've created a Gist containing the source code for a CSV reader.
It will load CSV data from a file, stream or string, exposing each record via a Next() record method and Get() and GetAt() field methods.
It implements various options for changing the field and EOL delimiter, automatic encoding from UTF8 files, quote stripping and normalisation of EOL character(s) to a preferred internal EOL character(s).
https://gist.github.com/SeanConnelly/cb4ff970c2df5266d24c8802d56f48da
I've been using some ugly data and looks to be working good enough, here is output from the Display() method...
Record No : 1 address = 1 long street, somewhere, north pole poetry = "Poetry" ======== From the "Anglo-Saxon" Rune Poem (Rune version): "ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾ ᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ"᛬ From Laȝamon's Brut (The Chronicles of England, Middle English, West Midlands): "An preost wes on leoden, Laȝamon was ihoten He wes Leovenaðes sone -- liðe him be Drihten. He wonede at Ernleȝe at æðelen are chirechen, Uppen Sevarne staþe, sel þar him þuhte, Onfest Radestone, þer he bock radde." (The third letter in the author's name is Yogh, missing from many fonts; CLICK HERE for another Middle English sample with some explanation of letters and encoding). From the Tagelied of Wolfram von Eschenbach (Middle High German): "Sîne klâwen durh die wolken sint geslagen, er stîget ûf mit grôzer kraft, ich sih in grâwen tägelîch als er wil tagen, den tac, der im geselleschaft erwenden wil, dem werden man, den ich mit sorgen în verliez. ich bringe in hinnen, ob ich kan. sîn vil manegiu tugent michz leisten hiez." Some lines of Odysseus Elytis (Greek): Monotonic: "Τη γλώσσα μου έδωσαν ελληνική το σπίτι φτωχικό στις αμμουδιές του Ομήρου. Μονάχη έγνοια η γλώσσα μου στις αμμουδιές του Ομήρου. από το Άξιον Εστί του Οδυσσέα Ελύτη" Polytonic: "Τὴ γλῶσσα μοῦ ἔδωσαν ἑλληνικὴ τὸ σπίτι φτωχικὸ στὶς ἀμμουδιὲς τοῦ Ὁμήρου. Μονάχη ἔγνοια ἡ γλῶσσα μου στὶς ἀμμουδιὲς τοῦ Ὁμήρου. ἀπὸ τὸ Ἄξιον ἐστί τοῦ Ὀδυσσέα Ἐλύτη The first stanza of Pushkin's Bronze Horseman (Russian): На берегу пустынных волн Стоял он, дум великих полн, И вдаль глядел. Пред ним широко Река неслася; бедный чёлн По ней стремился одиноко. По мшистым, топким берегам Чернели избы здесь и там, Приют убогого чухонца; И лес, неведомый лучам В тумане спрятанного солнца, Кругом шумел." Šota Rustaveli's Veṗxis Ṭq̇aosani, ̣︡Th, The Knight in the Tiger's Skin (Georgian): "ვეპხის ტყაოსანი შოთა რუსთაველი ღმერთსი შემვედრე, ნუთუ კვლა დამხსნას სოფლისა შრომასა, ცეცხლს, წყალსა და მიწასა, ჰაერთა თანა მრომასა; მომცნეს ფრთენი და აღვფრინდე, მივჰხვდე მას ჩემსა ნდომასა, დღისით და ღამით ვჰხედვიდე მზისა ელვათა კრთომაასა." Tamil poetry of Subramaniya Bharathiyar: சுப்ரமணிய பாரதியார் (1882-1921): "யாமறிந்த மொழிகளிலே தமிழ்மொழி போல் இனிதாவது எங்கும் காணோம், பாமரராய் விலங்குகளாய், உலகனைத்தும் இகழ்ச்சிசொலப் பான்மை கெட்டு, நாமமது தமிழரெனக் கொண்டு இங்கு வாழ்ந்திடுதல் நன்றோ? சொல்லீர்! தேமதுரத் தமிழோசை உலகமெலாம் பரவும்வகை செய்தல் வேண்டும்." Kannada poetry by Kuvempu — ಬಾ ಇಲ್ಲಿ ಸಂಭವಿಸು "ಬಾ ಇಲ್ಲಿ ಸಂಭವಿಸು ಇಂದೆನ್ನ ಹೃದಯದಲಿ ನಿತ್ಯವೂ ಅವತರಿಪ ಸತ್ಯಾವತಾರ ಮಣ್ಣಾಗಿ ಮರವಾಗಿ ಮಿಗವಾಗಿ ಕಗವಾಗೀ... ಮಣ್ಣಾಗಿ ಮರವಾಗಿ ಮಿಗವಾಗಿ ಕಗವಾಗಿ ಭವ ಭವದಿ ಭತಿಸಿಹೇ ಭವತಿ ದೂರ ನಿತ್ಯವೂ ಅವತರಿಪ ಸತ್ಯಾವತಾರ || ಬಾ ಇಲ್ಲಿ ||" spaceship = _________ (=========) |=========| |====_====| |== / \ ==| |= / _ \ =| _ |=| ( ) |=| /=\ |=|,,,,,|=| /=\ |=| |=| USA |=| |=| |=| |=|,,,,,|=| |=| |=| |=| | | |=| |=| |=| |=| | | |=| |=| |=| |=| | | |=| |=| |=| |/ | | \| |=| |=|/ | | \|=| |=/NASA |_| NASA\=| |(_______________)| |=| |_|__|__|_| |=| |=| ( ) ( ) |=| /===\ /===\ ||||||| ||||||| ------- ------- (""") (""") size = Big qty = 5.7
Still another approach using your original "solution" in PHP following the idea of a Micro-Service.
Instead of a mimic of what PHP might do I use it directly for this purpose.
That way more sophisticated functionalities that can be used without recoding.
I extended your test to include doubled double quotes
ABC Company,"123 Main St, Ste 102","Anytown, DC",10001,234-567-8901,"hi ""rcc"" was here"
USER>set reply=$$^phpCSV(%dstr) write reply,!! zwrite reply
ABC Company 123 Main St, Ste 102 Anytown, DC 10001 234-567-8901 hi "rcc" was here
reply="ABC Company"_$c(9)_"123 Main St, Ste 102"_$c(9)_"Anytown, DC"_$c(9)_"10001"_$c(9)_"234-567-8901"_$c(9)_"hi ""rcc"" was here"
*
and here the code:
#define php "............\php.exe " ; add location of php.exe
#define pipe "|CPIPE|1"
set file="myTest.php"
open file:"WN" use file
write "<?php "
,!,"$str='"_str_"';"
,!,"$strtotab = implode('\t', str_getcsv($str, ','));"
,!,"print_r($strtotab);"
,!,"?>",!
close file
open $$$pipe:$$$php_file
use $$$pipe read result
close $$$pipe
use 0 ; write result
quit $replace(result,"\t",$c(9))
}
I had a similar requirement, so created a simple method to achieve this. May be someone else is also looking for something similar
ClassMethod CreateCSVArray( csvrec = "", Output csvarray) { s csvarray=0 i csvrec'=""{ s csvlen=$l(csvrec) i csvlen>1{ s token="" f i=1:1:csvlen{ s char=$we(csvrec,i) i char=""""{ s closequotepos=$find(csvrec,"""",i+1) s csvarray($i(csvarray))=$we(csvrec,i+1,closequotepos-2) s i=closequotepos s token="" }else{ i char'=","{ s token=token_char }else{ s csvarray($i(csvarray))=token s token="" } } } s:token'="" csvarray($i(csvarray))=token } } }
Which produced the output:
myarr=5
myarr(1)="ABC Company"
myarr(2)="123 Main St, Ste 102"
myarr(3)="Anytown, DC"
myarr(4)=10001
myarr(5)="234-567-8901"