R: Parse a string by 3 consecutive letters-CodePudding

I have item codes and descriptions in the same column. The item codes can have letters, numbers, spaces and special characters. The description always starts with at least 4 letters. So I wanted to split the column at the space where the 4 letters begin. I converted the text field to lowercase

x <- c('1234 (a)-b free vacation to aruba',
       '1234:43-1b free set of dishes')

Ideally this would yield:

itemCode         itemDesctiption
1234 (a)-b       free vacation to aruba
1234:43-1b       free set of dishes

I tried splitting by spaces

[c('a', 'b', 'c', 'd', 'e', 'f')] <- str_split_fixed(x, ' ', 6)

but of course since spaces are sometimes embedded, it doesn't give me what I want.

I reviewed similar question, which are close, but not exactly what I am looking for.

CodePudding user response：

You could achieve this in base R using strsplit with a lookahead to get the item code, then to get the description remove the item code from the original string using sub:

x <- c('1234 (a)-b free vacation to aruba',
       '1234:43-1b free set of dishes')

a <- sapply(strsplit(x, '(?=[a-z]{4})', perl = TRUE), function(x) x[1])
b <- unlist(Map(function(a, b) sub(a, "", b, fixed = TRUE), a, x))

data.frame(itemCode = a, itemDescription = b, row.names = NULL)
#>      itemCode        itemDescription
#> 1 1234 (a)-b  free vacation to aruba
#> 2 1234:43-1b      free set of dishes

There's a slight caveat that [a-z]{4} will only work if the first 4 letters don't contain letters outside the standard 26 symbols in this set (for example, accented letters).

^{Created on 2023-01-21 with reprex v2.0.2}

CodePudding user response：

Please check the below code with str_extract

data.frame(x=c('1234 (a)-b free vacation to aruba',
       '1234:43-1b free set of dishes')) %>% 
mutate(itemCode=str_extract_all(trimws(x), '\\d .*[\\-|\\d]\\w\\s(?=\\w{4})'),
itemDesctiption=str_extract_all(trimws(x), '\\s\\w{4}\\s.*'))

^{Created on 2023-01-21 with reprex v2.0.2}

                                  x    itemCode         itemDesctiption
1 1234 (a)-b free vacation to aruba  1234 (a)-b   free vacation to aruba
2     1234:43-1b free set of dishes  1234:43-1b       free set of dishes

CodePudding user response：

This solution is based on tidyr's function extract:

library(tidyr)
library(dplyr)
data.frame(x) %>%
  extract(x,
          into = c("itemCode", "itemDescription"),
          regex = "([()0-9a-z-] )[\\s-] ([a-z]{4,}\\s.*)"
  )
            itemCode                                                   itemDescription
1           04(4)(a)                                          vacation - 2-3 weeks obo
2         230(11)(a)                                          cars - - 18 plus winners
3                073                                                boxes of choclates
4         130(11)(a)                   wont be offering -- too expensive - see details
5      23-3057(a)(5)                     grand prize / cruise for >= 18 year old (min)
6             33-314                                      choice of prizes & $500 cash
7  656-2-316(a)(iii) free books < 100 / price < 13 & 27 dollars. / or choice of prizes
8           231-5510                                           airfare (more than 200)
9           5er20c1a                                               prizes (under $500)
10             520g2                                                   prizes over 500
11         35-42-4-9                                                   prizes 250 plus
12        130(11)(b)                                                  retired category

How the regex works: essentially, the strings in x are divided into two capture groups containing what is to be extracted into the two new columns:

(.*[()0-9abc-]): the 1st capture group; here we assert that the group ends on any of parenthesis, digits, dash, or a, b, or c (please adapt as you see fit!)
\\s: one whitespace (not captured)
([a-z]{4,}\\s.*): 2nd capture group. here we assert that there must be at least 4 lower-case letters followed by one whitespace and more characters

EDIT 1:

check out:

regex = "([()0-9a-z-] )[\\s-] (.*)"

seems to work too!

EDIT 2:

based on the observation that itemCode is never interrupted by whitespace, this works too:

regex = "(\\S )[\\s-] (.*)"