I have item codes and descriptions in the same column. The item codes can have letters, numbers, spaces and special characters. The description always starts with at least 4 letters. So I wanted to split the column at the space where the 4 letters begin. I converted the text field to lowercase
x <- c('1234 (a)-b free vacation to aruba',
'1234:43-1b free set of dishes')
Ideally this would yield:
itemCode itemDesctiption
1234 (a)-b free vacation to aruba
1234:43-1b free set of dishes
I tried splitting by spaces
[c('a', 'b', 'c', 'd', 'e', 'f')] <- str_split_fixed(x, ' ', 6)
but of course since spaces are sometimes embedded, it doesn't give me what I want.
I reviewed similar question, which are close, but not exactly what I am looking for.
CodePudding user response:
You could achieve this in base R using strsplit
with a lookahead to get the item code, then to get the description remove the item code from the original string using sub
:
x <- c('1234 (a)-b free vacation to aruba',
'1234:43-1b free set of dishes')
a <- sapply(strsplit(x, '(?=[a-z]{4})', perl = TRUE), function(x) x[1])
b <- unlist(Map(function(a, b) sub(a, "", b, fixed = TRUE), a, x))
data.frame(itemCode = a, itemDescription = b, row.names = NULL)
#> itemCode itemDescription
#> 1 1234 (a)-b free vacation to aruba
#> 2 1234:43-1b free set of dishes
There's a slight caveat that [a-z]{4}
will only work if the first 4 letters don't contain letters outside the standard 26 symbols in this set (for example, accented letters).
Created on 2023-01-21 with reprex v2.0.2
CodePudding user response:
Please check the below code with str_extract
data.frame(x=c('1234 (a)-b free vacation to aruba',
'1234:43-1b free set of dishes')) %>%
mutate(itemCode=str_extract_all(trimws(x), '\\d .*[\\-|\\d]\\w\\s(?=\\w{4})'),
itemDesctiption=str_extract_all(trimws(x), '\\s\\w{4}\\s.*'))
Created on 2023-01-21 with reprex v2.0.2
x itemCode itemDesctiption
1 1234 (a)-b free vacation to aruba 1234 (a)-b free vacation to aruba
2 1234:43-1b free set of dishes 1234:43-1b free set of dishes
CodePudding user response:
This solution is based on tidyr
's function extract
:
library(tidyr)
library(dplyr)
data.frame(x) %>%
extract(x,
into = c("itemCode", "itemDescription"),
regex = "([()0-9a-z-] )[\\s-] ([a-z]{4,}\\s.*)"
)
itemCode itemDescription
1 04(4)(a) vacation - 2-3 weeks obo
2 230(11)(a) cars - - 18 plus winners
3 073 boxes of choclates
4 130(11)(a) wont be offering -- too expensive - see details
5 23-3057(a)(5) grand prize / cruise for >= 18 year old (min)
6 33-314 choice of prizes & $500 cash
7 656-2-316(a)(iii) free books < 100 / price < 13 & 27 dollars. / or choice of prizes
8 231-5510 airfare (more than 200)
9 5er20c1a prizes (under $500)
10 520g2 prizes over 500
11 35-42-4-9 prizes 250 plus
12 130(11)(b) retired category
How the regex
works: essentially, the strings in x
are divided into two capture groups containing what is to be extracted into the two new columns:
(.*[()0-9abc-])
: the 1st capture group; here we assert that the group ends on any of parenthesis, digits, dash, ora
,b
, orc
(please adapt as you see fit!)\\s
: one whitespace (not captured)([a-z]{4,}\\s.*)
: 2nd capture group. here we assert that there must be at least 4 lower-case letters followed by one whitespace and more characters
EDIT 1:
check out:
regex = "([()0-9a-z-] )[\\s-] (.*)"
seems to work too!
EDIT 2:
based on the observation that itemCode
is never interrupted by whitespace, this works too:
regex = "(\\S )[\\s-] (.*)"