I have a character vector made up of various strings, most of which need to be split out into their own index within the vector. I've searched for ways to split after a certain pattern (4 consecutive digits in this case), without losing the 4 digits. I've found answers related to C#, Java, and in R related to a data frame, but nothing helpful for a character vector.
Here is what my output looks like:
> text
[1] "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"
[3] "LONDON 2024 GREECE 2004 LOS ANGELES 1984 SINGAPORE 1964 A BERLIN 1936"
[4] "CHINA 2020 NEW ZEALAND 2000 MOSCOW 1980 S MEXICO 1960 CHILE 1932"
[5] " MELBOURNE"
[6] "LA 2016 LONDON 1996 MOSCOW 1976 MELBOURNE 1956 ROME 1928"
[7] " ANTWERP 2012 SPAIN 1992 LONDON 1972 US 1952 NEW ZEALAND 1924"
[8] " MEXICO 1920 CHINA 1896"
[9] " JAPAN 1912"
[10] " MANILA 1908"
[11] "ST. LOUIS 1904"
[12] " SWEDEN 1900"
``
I need to split all the strings so that they look like this (Just showing the first few lines for example)
[1] "LA 2028"
[2] "ITALY 2008"
[3] "AUSTRALIA 1988"
[4] "B MEXICO 1968"
[5] "GREECE 1948"
I have tried using `strsplit`
text <- strsplit(text, "[[:digit:]]{4}")
and tried multiple ways to add a comma after any 4 consecutive digits to give me something to split it by without needing to keep the character I split at.
Any ideas?
CodePudding user response:
We can you look behind assertions in Perl Regex to achieve what you want:
a <- "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"
strsplit(a,"(?<=[0-9]{4})\\s",perl=T)
[[1]]
[1] "LA 2028" "ITALY 2008" "AUSTRALIA 1988" "B MEXICO 1968" "GREECE 1948"
The look behind assertion will check for a whitespace that comes after the 4 digits and split at that point. This will achieve your desired output
CodePudding user response:
For a more robust/safer approach than string splitting, you could phrase your problem as a regex find all on the pattern \w (?: \w )*? \d{4}
:
x <- "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"
m <- regmatches(x, gregexpr('\\w (?: \\w )*? \\d{4}', x))
m[[1]]
[1] "LA 2028" "ITALY 2008" "AUSTRALIA 1988" "B MEXICO 1968"
[5] "GREECE 1948"
CodePudding user response:
an easy way would be to first add an extra space after four digits and then to split at every two-spaces:
mystring <- "LA 2028 ITALY 2008 AUSTRALIA 1988 B MEXICO 1968 GREECE 1948"
strsplit(gsub("( \\d{4} )", "\\1 ", mystring), " ")
[[1]]
[1] "LA 2028" "ITALY 2008" "AUSTRALIA 1988" "B MEXICO 1968" "GREECE 1948"
CodePudding user response:
You mention adding a comma after the digits as a splitting object. While this is probably not the preferred way (unlike the other solutions proposed), this would work if you use backreference, like so:
strsplit(gsub("(\\d{4})", "\\1,", x), ", ?")[[1]]
[1] "LA 2028" "ITALY 2008" "AUSTRALIA 1988" "B MEXICO 1968" "GREECE 1948"