I would like to match the last occurrence of "," in a string to be able to separate
a column properly.
Sample data
tibble(
name = c("John", "James"),
company_num = c("Apple, Inc, 1000",
"Microsoft, 1200")
)
The desired output:
# A tibble: 2 x 3
name company num
<chr> <chr> <dbl>
1 John Apple, Inc 1000
2 James Microsoft 1200
What I have tried
df %>%
separate(col = company_num,
into = c("company", "num"),
sep = ", ")
CodePudding user response:
With the stringr
package and regex, in particular positive lookahead or lookbehind, with (?=lookahead)
and (?<=lookbehind)
.
Assuming your dataframe or tibble is df
:
df %>% mutate(
company = stringr::str_extract(company_num, ".*(?=,.*$)"),
# ↑ capture the text BEFORE a comma that is followed by one or several character ".*" at the end of the string "$"
num = stringr::str_extract(company_num, "(?<=, )[0-9]*$")) %>%
# ↑ capture one or several digits at the end of the string ($), preceded by ','
select(-company_num)
# ↑ drop the "company_num" col
CodePudding user response:
With the correct regex, you can use separate
in a single line.
This put the comma in a capture group (, )
, where this comma cannot have another comma behind it (?!.*,)
.
library(tidyr)
library(dplyr)
df %>% separate(company_num, into = c("company", "num"), sep = "(, )(?!.*,)")
# A tibble: 2 × 3
name company num
<chr> <chr> <chr>
1 John Apple, Inc 1000
2 James Microsoft 1200
CodePudding user response:
I'm sure there's a one-line tidyverse solution here but here's a base R method:
# Get location of last comma
widths <- regexpr(",[^,]*$", df$company_num)
# Treat company_num as fixed width format file
df[c("Company", "Number")] <- read.fwf(
textConnection(df$company_num),
widths,
col.names = c("Company", "Number")
)
# Remove trailing comma/space
df$Company <- sub(",\\s?$", "", df$Company)
df
# A tibble: 2 x 4
# name company_num Company Number
# <chr> <chr> <chr> <int>
# 1 John Apple, Inc, 1000 Apple, Inc 1000
# 2 James Microsoft, 1200 Microsoft 1200
Update - tidyverse solution
If it must be tidyverse, then borrowing heavily from the answer to this question, you could do:
df |>
separate(
col = company_num,
into = c("company", "num"),
sep = "(?=[[:digit:]])", extra="merge"
) |>
mutate(
company = sub(",\\s?$", "", company),
num = as.numeric(num)
)
# # A tibble: 2 x 3
# name company num
# <chr> <chr> <dbl>
# 1 John Apple, Inc 1000
# 2 James Microsoft 1200
CodePudding user response:
You can use sub
like.
data.frame(x[1],
company = sub(",[^,]*$", "", x$company_num),
num = sub(".*, ", "", x$company_num))
# name company num
#1 John Apple, Inc 1000
#2 James Microsoft 1200
Data
x <- data.frame(name = c("John", "James"),
company_num = c("Apple, Inc, 1000", "Microsoft, 1200") )