I was working on a language parser and I wanted to count certain string elements (say "</i>"
) in a larger string. Since the string has been cleansed (str.trim
), it doesn't have any content after it. I was getting some weird behavior on strsplit
as it seems to behave differently if the separator sep
(called split
in RTM) is at the beginning or end of the string.
Below is an example:
str1 = "<i>hello friend</i>";
str2 = paste0(" ",str1);
str3 = paste0(str1, " ");
sep1="<i>";
sep2="</i>";
str = c(str1, str2, str3); n = length(str);
sep = c(sep1, sep2); ns = length(sep);
base = matrix("", nrow=n, ncol=ns);
rownames(base) = str; colnames(base) = sep;
for(i in 1:n)
{
for(j in 1:ns)
{
base[i, j] = paste0(base::strsplit(str[i], sep[j], fixed=TRUE)[[1]], collapse="|");
}
}
base;
stringi = matrix("", nrow=n, ncol=ns);
rownames(stringi) = str; colnames(stringi) = sep;
for(i in 1:n)
{
for(j in 1:ns)
{
stringi[i, j] = paste0(stringi::stri_split_fixed(str[i], sep[j])[[1]], collapse="|");
}
}
stringi;
stopifnot(identical(base,stringi));
The output for base:
> base;
<i> </i>
<i>hello friend</i> "|hello friend</i>" "<i>hello friend"
<i>hello friend</i> " |hello friend</i>" " <i>hello friend"
<i>hello friend</i> "|hello friend</i> " "<i>hello friend| "
The output for stringi:
> stringi;
<i> </i>
<i>hello friend</i> "|hello friend</i>" "<i>hello friend|"
<i>hello friend</i> " |hello friend</i>" " <i>hello friend|"
<i>hello friend</i> "|hello friend</i> " "<i>hello friend| "
The core difference is ROW=1, COL=2 ...
Question: What is E[strsplit]
?
Is base a FEATURE and stringi a BUG? Or vice versa?
Should not EOS (end of string) splits behave the same as BOS (beginning of string) splits?
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
crt ucrt
system x86_64, mingw32
status
major 4
minor 2.1
year 2022
month 06
day 23
svn rev 82513
language R
version.string R version 4.2.1 (2022-06-23 ucrt)
nickname Funny-Looking Kid
and
> packageVersion("stringi")
[1] ‘1.7.8’
>
CodePudding user response:
Well, I would say that the stringi
behavior is the one at least I'd expect (and there you have the option to discard empty strings by setting omit_empty = TRUE
).
However, since base::strsplit
clearly documents the behavior it is also a "feature". From ?strsplit
:
Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as [the input] with the match removed.
stringi
provides a much more configurable interface at the expense of another dependency.