How can you split a string in R without removing the characters you want to split after?-CodePudding

I have a character vector made up of various strings, most of which need to be split out into their own index within the vector. I've searched for ways to split after a certain pattern (4 consecutive digits in this case), without losing the 4 digits. I've found answers related to C#, Java, and in R related to a data frame, but nothing helpful for a character vector.

Here is what my output looks like:

> text
 [1] "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"                                                                                           
 [3] "LONDON 2024 GREECE 2004 LOS ANGELES 1984 SINGAPORE 1964 A BERLIN 1936"                                                                                  
 [4] "CHINA 2020 NEW ZEALAND 2000 MOSCOW 1980 S MEXICO 1960 CHILE 1932"            
 [5] " MELBOURNE"                                                                                                                                          
 [6] "LA 2016 LONDON 1996 MOSCOW 1976 MELBOURNE 1956 ROME 1928"
 [7] " ANTWERP 2012 SPAIN 1992 LONDON 1972 US 1952 NEW ZEALAND 1924"           
 [8] " MEXICO 1920 CHINA 1896"                                                  
 [9] " JAPAN 1912"                                                                                                                                    
[10] " MANILA 1908"                                                                                                                                        
[11] "ST. LOUIS 1904"                                                                                                                                      
[12] " SWEDEN 1900" 

``
I need to split all the strings so that they look like this (Just showing the first few lines for example)

 [1] "LA 2028" 
 [2] "ITALY 2008" 
 [3] "AUSTRALIA 1988" 
 [4] "B MEXICO 1968" 
 [5] "GREECE 1948"                                                                                           

I have tried using `strsplit`

text <- strsplit(text, "[[:digit:]]{4}")


and tried multiple ways to add a comma after any 4 consecutive digits to give me something to split it by without needing to keep the character I split at. 

Any ideas?

CodePudding user response：

We can you look behind assertions in Perl Regex to achieve what you want:

a <- "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"
strsplit(a,"(?<=[0-9]{4})\\s",perl=T) 

[[1]]
[1] "LA 2028"        "ITALY 2008"     "AUSTRALIA 1988" "B MEXICO 1968"  "GREECE 1948"

The look behind assertion will check for a whitespace that comes after the 4 digits and split at that point. This will achieve your desired output

CodePudding user response：

For a more robust/safer approach than string splitting, you could phrase your problem as a regex find all on the pattern \w (?: \w )*? \d{4}:

x <- "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"
m <- regmatches(x, gregexpr('\\w (?: \\w )*? \\d{4}', x))
m[[1]]

[1] "LA 2028"        "ITALY 2008"     "AUSTRALIA 1988" "B MEXICO 1968"
[5] "GREECE 1948"

CodePudding user response：

an easy way would be to first add an extra space after four digits and then to split at every two-spaces:

mystring <- "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"
strsplit(gsub("( \\d{4} )", "\\1 ", mystring), "  ")
[[1]]
[1] "LA 2028"        "ITALY 2008"     "AUSTRALIA 1988" "B MEXICO 1968"  "GREECE 1948"

CodePudding user response：

You mention adding a comma after the digits as a splitting object. While this is probably not the preferred way (unlike the other solutions proposed), this would work if you use backreference, like so:

strsplit(gsub("(\\d{4})", "\\1,", x), ", ?")[[1]]
[1] "LA 2028"        "ITALY 2008"     "AUSTRALIA 1988" "B MEXICO 1968"  "GREECE 1948"