Home > Enterprise >  Using R, how does strsplit work on fixed elements with the splitter at the end of the string to spli
Using R, how does strsplit work on fixed elements with the splitter at the end of the string to spli

Time:09-16

I was working on a language parser and I wanted to count certain string elements (say "</i>") in a larger string. Since the string has been cleansed (str.trim), it doesn't have any content after it. I was getting some weird behavior on strsplit as it seems to behave differently if the separator sep (called split in RTM) is at the beginning or end of the string.

Below is an example:

str1 = "<i>hello friend</i>"; 
str2 = paste0(" ",str1);
str3 = paste0(str1, " ");

sep1="<i>";
sep2="</i>";

str = c(str1, str2, str3);  n = length(str);
sep = c(sep1, sep2);        ns = length(sep);

base = matrix("", nrow=n, ncol=ns);
rownames(base) = str; colnames(base) = sep;
for(i in 1:n)
    {
    for(j in 1:ns)
        {
        base[i, j] = paste0(base::strsplit(str[i], sep[j], fixed=TRUE)[[1]], collapse="|");
        }   
    }
base;
    
stringi = matrix("", nrow=n, ncol=ns);
rownames(stringi) = str; colnames(stringi) = sep;
for(i in 1:n)
    {
    for(j in 1:ns)
        {
        stringi[i, j] = paste0(stringi::stri_split_fixed(str[i], sep[j])[[1]], collapse="|");
        }   
    }
stringi;

stopifnot(identical(base,stringi));

The output for base:

> base;
                     <i>                  </i>               
<i>hello friend</i>  "|hello friend</i>"  "<i>hello friend"  
 <i>hello friend</i> " |hello friend</i>" " <i>hello friend" 
<i>hello friend</i>  "|hello friend</i> " "<i>hello friend| "

The output for stringi:

> stringi;
                     <i>                  </i>               
<i>hello friend</i>  "|hello friend</i>"  "<i>hello friend|" 
 <i>hello friend</i> " |hello friend</i>" " <i>hello friend|"
<i>hello friend</i>  "|hello friend</i> " "<i>hello friend| "

The core difference is ROW=1, COL=2 ...

Question: What is E[strsplit]?

Is base a FEATURE and stringi a BUG? Or vice versa?

Should not EOS (end of string) splits behave the same as BOS (beginning of string) splits?

> R.version
               _                                
platform       x86_64-w64-mingw32               
arch           x86_64                           
os             mingw32                          
crt            ucrt                             
system         x86_64, mingw32                  
status                                          
major          4                                
minor          2.1                              
year           2022                             
month          06                               
day            23                               
svn rev        82513                            
language       R                                
version.string R version 4.2.1 (2022-06-23 ucrt)
nickname       Funny-Looking Kid            

and

> packageVersion("stringi")
[1] ‘1.7.8’
> 

CodePudding user response:

Well, I would say that the stringi behavior is the one at least I'd expect (and there you have the option to discard empty strings by setting omit_empty = TRUE).

However, since base::strsplit clearly documents the behavior it is also a "feature". From ?strsplit:

Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as [the input] with the match removed.

stringi provides a much more configurable interface at the expense of another dependency.

  • Related