Breaking a string into words
Is there any ObjectScript or a basic function that takes a string and separates it into words + punctuation/spaces array/list? Just not to reinvent the wheel. Say, process "It is a test, after all" into "It", space, "is", space, "a", space, "test", ", ", "after", space, "all". Or something to that effect.
Could be a practical use of embedded python.
https://www.w3schools.com/python/ref_string_split.asp
a little more typing if you use $LOCATE function
It is based on a specific separator as opposed to all regular text separators at the same time, but it is an interesting approach to keep in mind. I was thinking about $LOCATE as well.
Depending on the fidelity you need, something like this would work:
set str = "abc def! xyz" set punctuation = "'!""#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~" set strNoPuncuation = $tr(str, punctuation, $j("", $l(punctuation))) set strDedupeWhitespaces = $zstrip(strNoPuncuation,"<=>P") set out = $lfs(strDedupeWhitespaces, " ")
Another approach. Simpler and likely faster but it will merge sentence ends without whitespace afterwards:
set str = "abc def! xyz" set strNoPuncuation = $zstrip(str,"*P",," ") set strDedupeWhitespaces = $zstrip(strNoPuncuation,"<=>P") set out = $lfs(strDedupeWhitespaces, " ")
Check $translate, $zstrip.
If you want more fidelity/features check %iKnow.Stemming package.
Very interesting, thanks!
I found a tokenize method in one of the %DeepSee packages:
Good to know!
Plus one for the LFS approach, (I wrote this before noticing it was in Eduard's more complete solution.
If you don't need to keep the spaces, because they're the delimiter, I find that $LFS works as a quick and dirty. It depends what you want the punctuation for. You can put spaces back in when do something with the list if you need them.
You could also use a regular expression with the %Regex.Matcher class
set regex = ##Class(%Regex.Matcher).%New("(/w)*")
The "/w" refers to any word character include alphabetic, numeric, and connecting characters). This is wrapped in a grouping expression '()' and finally the * say match 0 or more occurences.
You can then examine the GroupCount and Group multidimensional properties to see the results.
That's the code I ended up with. Thanks for your help, everybody!
; Trim leading and trailing spaces here if needed
S L=$L(str),(currWord,currSep)="",cnt=0
F i=1:1:L {
S currChar=$E(str,i,i)
I $MATCH(currChar,"\w") {
S currWord=currWord_currChar
I currSep'="" {
S sepAr(cnt)=currSep,currSep=""
}
}
ELSE {
S currSep=currSep_currChar
I currWord'="" {
S cnt=cnt+1,wordAr(cnt)=currWord,currWord=""
}
}
}