Home > Enterprise >  Regex match the last occurrence of a character in a string
Regex match the last occurrence of a character in a string

Time:10-12

I would like to match the last occurrence of "," in a string to be able to separate a column properly.

Sample data

tibble(
  name = c("John", "James"), 
  company_num = c("Apple, Inc, 1000",
                  "Microsoft, 1200")
)

The desired output:

# A tibble: 2 x 3
  name  company      num
  <chr> <chr>      <dbl>
1 John  Apple, Inc  1000
2 James Microsoft   1200

What I have tried

df %>% 
  separate(col = company_num, 
           into = c("company", "num"), 
           sep = ", ")

CodePudding user response:

With the stringr package and regex, in particular positive lookahead or lookbehind, with (?=lookahead) and (?<=lookbehind).

Assuming your dataframe or tibble is df:

df %>% mutate(

company = stringr::str_extract(company_num, ".*(?=,.*$)"),
#  ↑ capture the text BEFORE a comma that is followed by one or several character ".*" at the end of the string "$"

num = stringr::str_extract(company_num, "(?<=, )[0-9]*$")) %>%
# ↑ capture one or several digits at the end of the string ($), preceded by ',' 

select(-company_num)
# ↑ drop the "company_num" col

CodePudding user response:

With the correct regex, you can use separate in a single line.

This put the comma in a capture group (, ), where this comma cannot have another comma behind it (?!.*,).

library(tidyr)
library(dplyr)

df %>% separate(company_num, into = c("company", "num"), sep = "(, )(?!.*,)")

# A tibble: 2 × 3
  name  company    num  
  <chr> <chr>      <chr>
1 John  Apple, Inc 1000 
2 James Microsoft  1200 

CodePudding user response:

I'm sure there's a one-line tidyverse solution here but here's a base R method:

# Get location of last comma
widths  <- regexpr(",[^,]*$", df$company_num)

# Treat company_num as fixed width format file
df[c("Company", "Number")]  <- read.fwf(
    textConnection(df$company_num), 
    widths,
    col.names = c("Company", "Number")
)

# Remove trailing comma/space
df$Company  <- sub(",\\s?$", "", df$Company)

df

# A tibble: 2 x 4
#   name  company_num      Company    Number
#   <chr> <chr>            <chr>       <int>
# 1 John  Apple, Inc, 1000 Apple, Inc   1000
# 2 James Microsoft, 1200  Microsoft    1200

Update - tidyverse solution

If it must be tidyverse, then borrowing heavily from the answer to this question, you could do:

df |>
    separate(
        col = company_num, 
        into = c("company", "num"), 
        sep = "(?=[[:digit:]])", extra="merge"
    ) |>
    mutate(
        company = sub(",\\s?$", "", company),
        num = as.numeric(num)
    )

# # A tibble: 2 x 3
#   name  company      num
#   <chr> <chr>      <dbl>
# 1 John  Apple, Inc  1000
# 2 James Microsoft   1200

CodePudding user response:

You can use sub like.

data.frame(x[1],
           company = sub(",[^,]*$", "", x$company_num),
           num = sub(".*, ", "", x$company_num))
#   name    company  num
#1  John Apple, Inc 1000
#2 James  Microsoft 1200

Data

x <- data.frame(name = c("John", "James"), 
                company_num = c("Apple, Inc, 1000", "Microsoft, 1200") )
  • Related