Breaking a string into words

Question

Question

Anna Golitsyna · Apr 13, 2023

Is there any ObjectScript or a basic function that takes a string and separates it into words + punctuation/spaces array/list? Just not to reinvent the wheel. Say, process "It is a test, after all" into "It", space, "is", space, "a", space, "test", ", ", "after", space, "all". Or something to that effect.

Discussion (9)2

Log in or sign up to continue

Anna Golitsyna · May 10, 2023

That's the code I ended up with. Thanks for your help, everybody!

; str is parsed into two arrays, words and separators (spaces and punctuation)
; Trim leading and trailing spaces here if needed
S L=$L(str),(currWord,currSep)="",cnt=0
F i=1:1:L {
S currChar=$E(str,i,i)
I $MATCH(currChar,"\w") {
S currWord=currWord_currChar
I currSep'="" {
S sepAr(cnt)=currSep,currSep=""
}
}
ELSE {
S currSep=currSep_currChar
I currWord'="" {
S cnt=cnt+1,wordAr(cnt)=currWord,currWord=""
}
}
}

0 0

score 1 · Answer 1 · 2023-04-13T18:01:39-04:00

Could be a practical use of embedded python.
https://www.w3schools.com/python/ref_string_split.asp

a little more typing if you use $LOCATE function

score 0 · Answer 2 · 2023-04-14T10:07:43-04:00

It is based on a specific separator as opposed to all regular text separators at the same time, but it is an interesting approach to keep in mind. I was thinking about $LOCATE as well.

score 1 · Answer 3 · 2023-04-14T06:52:15-04:00

Depending on the fidelity you need, something like this would work:

set str = "abc def!  xyz"
set punctuation = "'!""#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~"
set strNoPuncuation = $tr(str, punctuation, $j("", $l(punctuation)))
set strDedupeWhitespaces = $zstrip(strNoPuncuation,"<=>P")
set out = $lfs(strDedupeWhitespaces, " ")ObjectScript
ObjectScript

Another approach. Simpler and likely faster but it will merge sentence ends without whitespace afterwards:

set str = "abc def!  xyz"
set strNoPuncuation = $zstrip(str,"*P",," ")
set strDedupeWhitespaces = $zstrip(strNoPuncuation,"<=>P")
set out = $lfs(strDedupeWhitespaces, " ")ObjectScript
ObjectScript

Check $translate, $zstrip.

If you want more fidelity/features check %iKnow.Stemming package.

score 0 · Answer 4 · 2023-04-14T10:15:55-04:00

Anna Golitsyna Apr 14, 2023 to Eduard Lebedyuk

Very interesting, thanks!

0 0

score 2 · Answer 5 · 2023-04-14T08:30:57-04:00

I found a tokenize method in one of the %DeepSee packages:

USER>w ##class(%DeepSee.extensions.utils.StringMatchUtils).tokenize("this is a string blah blah.",.tokenArray)
1
USER>w

tokenArray=6
tokenArray(1)="this"
tokenArray(2)="is"
tokenArray(3)="a"
tokenArray(4)="string"
tokenArray(5)="blah"
tokenArray(6)="blah"
USER>

score 0 · Answer 6 · 2023-04-14T10:16:53-04:00

Anna Golitsyna Apr 14, 2023 to Marc Mundt

Good to know!

0 0

score 1 · Answer 7 · 2023-04-14T10:33:08-04:00

Plus one for the LFS approach, (I wrote this before noticing it was in Eduard's more complete solution.
If you don't need to keep the spaces, because they're the delimiter, I find that $LFS works as a quick and dirty. It depends what you want the punctuation for. You can put spaces back in when do something with the list if you need them.

score 0 · Answer 8 · 2023-05-10T10:14:54-04:00

You could also use a regular expression with the %Regex.Matcher class

set regex = ##Class(%Regex.Matcher).%New("(/w)*")

The "/w" refers to any word character include alphabetic, numeric, and connecting characters). This is wrapped in a grouping expression '()' and finally the * say match 0 or more occurences.

You can then examine the GroupCount and Group multidimensional properties to see the results.