How to Split Columns in R based on First Space-CodePudding

I have this code that splits the column on the second space, but I don't know how to modify it to split on the first space only. I'm not that familiar with regex.

library(tidyr)

df <- data.frame(Location = c("San Jose CA", "Fremont CA", "Santa Clara CA"))
separate(df, Location, into = c("city", "state"), sep = " (?=[^ ] $)")

#          city state
# 1    San Jose    CA
# 2     Fremont    CA
# 3 Santa Clara    CA

CodePudding user response：

If you want to stick with separate, then try:

separate(df, Location, into=c("city", "state"), sep=" (?=[A-Z]{2}$)")

We can also try using sub here for a base R option:

df$city <- sub("\\s [A-Z]{2}$", "", df$Location)
df$state <- sub("^.*\\s ", "", df$Location)

CodePudding user response：

You can use

library(tidyr)
df <- data.frame(Location = c("San Jose CA", "Fremont CA", "Santa Clara CA"))
df_new <- separate(df, Location, into = c("city", "state"), sep = "^\\S*\\K\\s ")

Output:

> df_new
     city      state
1     San    Jose CA
2 Fremont         CA
3   Santa   Clara CA

The ^\S*\K\s regex matches

^ - start of string
\S* - zero or more non-whitespace chars
\K - match reset operator that discards the text matched so far from the overall match memory buffer
\s - one or more whitespace chars.

NOTE: If your strings can have leading whitespace, and you want to ignore this leading whitespace, you can add \\s* right after ^ and use

sep = "^\\s*\\S \\K\\s "

Here, \S will require at least one (or more) non-whitespace chars to exist before the whitespaces that the string is split with.