I have a huge dataframe of sales data with customer data and order information. Unfortunately for each customer entry, the items sold per customer are all smashed into together in one long string i.e. Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15. I've been messing around with dplyr and regex commands but it just creates the new columns.
newdf <- df %>%
separate(V3,
into = c("Item 1", "Item 2", "Item 3", "Item 4", "Item 5", "Item 6", "Item 7"),
sep = "(?=[0-9])(?<=[A-Za-z])")
Input
V1 | V2 | V3 |
---|---|---|
Customer | Items Purchased | Items |
Jane | 3 | Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15 |
Bob | 1 | Table x 1 @ $75 |
Mark | 7 | Apples x 2 @ $7Bread x 1 @ $7Chicken x 3 @ $15Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15Seeds x 3 @ $2 |
Desired Output
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 |
---|---|---|---|---|---|---|---|---|
Customer | Items Purchased | Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | Item 6 | Item 7 |
Jane | 3 | Eggs x 2 @ $6 | Bread x 2 @ $5 | Chicken x 3 @ $15 | ||||
Bob | 1 | Table x 1 @ $75 | ||||||
Mark | 7 | Apples x 2 @ $7 | Bread x 1 @ $7 | Chicken x 3 @ $15 | Eggs x 2 @ $6 | Bread x 2 @ $5 | Chicken x 3 @ $15 | Seeds x 3 @ $2 |
CodePudding user response:
You can create a regex pattern 'word x number @ $number'
which can be written as '\\w x \\d @ \\$\\d '
and use it to extract the relevant text from Items
column.
tmp <- stringr::str_extract_all(df$Items, '\\w x \\d @ \\$\\d ')
tmp
#[[1]]
#[1] "Eggs x 2 @ $6" "Bread x 2 @ $5" "Chicken x 3 @ $15"
#[[2]]
#[1] "Table x 1 @ $75"
#[[3]]
#[1] "Apples x 2 @ $7" "Bread x 1 @ $7" "Chicken x 3 @ $15" "Eggs x 2 @ $6"
#[5] "Bread x 2 @ $5" "Chicken x 3 @ $15" "Seeds x 3 @ $2"
Add them as a new columns in the dataframe appending NA
s for shorter length data.
n <- 1:max(lengths(tmp))
result <- cbind(df[1:2], t(sapply(tmp, `[`, n)))
result
# Customer Items.Purchased 1 2 3
#1 Jane 3 Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15
#2 Bob 1 Table x 1 @ $75 <NA> <NA>
#3 Mark 7 Apples x 2 @ $7 Bread x 1 @ $7 Chicken x 3 @ $15
# 4 5 6 7
#1 <NA> <NA> <NA> <NA>
#2 <NA> <NA> <NA> <NA>
#3 Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15 Seeds x 3 @ $2
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Customer = c("Jane", "Bob", "Mark"), Items.Purchased = c(3L,
1L, 7L), Items = c("Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15",
"Table x 1 @ $75", "Apples x 2 @ $7Bread x 1 @ $7Chicken x 3 @ $15Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15Seeds x 3 @ $2"
)), row.names = c(NA, -3L), class = "data.frame")
CodePudding user response:
I'd do this in two steps: Mark where the splits should be then do the separate:
library(tidyverse)
fun_split <- function(x) {
str_replace_all(x, "([0-9])([A-Z])", "\\1 - \\2")
}
df <- as_tibble(df)
df$Items <- fun_split(df$Items)
df_new <- df %>%
separate(Items,
into = c("Item 1", "Item 2", "Item 3", "Item 4", "Item 5", "Item 6", "Item 7"),
sep = " - ")
CodePudding user response:
Here is another way you could use. It should be noted that I used a regex pattern to match your desired sub strings but since it could also match the who string I used ?
operator to make it lazy so that after it matches the shortest possible string it again looks for another one:
library(dplyr)
library(tidyr)
df %>%
mutate(output = regmatches(Items, gregexpr("[A-Z].*?\\$\\d ", Items, perl = TRUE))) %>%
as_tibble() %>%
unnest_wider(output) %>%
setNames(gsub("(\\.){3}(\\d)", "Item\\2", names(.)))
# A tibble: 3 x 10
Customer Items.Purchased Items Item1 Item2 Item3 Item4 Item5 Item6 Item7
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Jane 3 Eggs x 2 @ $6Brea~ Eggs x~ Bread ~ Chick~ NA NA NA NA
2 Bob 1 Table x 1 @ $75 Table ~ NA NA NA NA NA NA
3 Mark 7 Apples x 2 @ $7Br~ Apples~ Bread ~ Chick~ Eggs ~ Bread~ Chick~ Seeds~