Dataframe Manipulation in R: Separating Multiple Items in a Single Order between a Number and Letter-CodePudding

I have a huge dataframe of sales data with customer data and order information. Unfortunately for each customer entry, the items sold per customer are all smashed into together in one long string i.e. Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15. I've been messing around with dplyr and regex commands but it just creates the new columns.

newdf <- df %>%
         separate(V3, 
              into = c("Item 1", "Item 2", "Item 3", "Item 4", "Item 5", "Item 6", "Item 7"), 
          sep = "(?=[0-9])(?<=[A-Za-z])")

Input

V1	V2	V3
Customer	Items Purchased	Items
Jane	3	Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15
Bob	1	Table x 1 @ $75
Mark	7	Apples x 2 @ $7Bread x 1 @ $7Chicken x 3 @ $15Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15Seeds x 3 @ $2

Desired Output

V1	V2	V3	V4	V5	V6	V7	V8	V9
Customer	Items Purchased	Item 1	Item 2	Item 3	Item 4	Item 5	Item 6	Item 7
Jane	3	Eggs x 2 @ $6	Bread x 2 @ $5	Chicken x 3 @ $15
Bob	1	Table x 1 @ $75
Mark	7	Apples x 2 @ $7	Bread x 1 @ $7	Chicken x 3 @ $15	Eggs x 2 @ $6	Bread x 2 @ $5	Chicken x 3 @ $15	Seeds x 3 @ $2

CodePudding user response：

You can create a regex pattern 'word x number @ $number' which can be written as '\\w x \\d @ \\$\\d ' and use it to extract the relevant text from Items column.

tmp <- stringr::str_extract_all(df$Items, '\\w  x \\d  @ \\$\\d ')
tmp

#[[1]]
#[1] "Eggs x 2 @ $6"     "Bread x 2 @ $5"    "Chicken x 3 @ $15"

#[[2]]
#[1] "Table x 1 @ $75"

#[[3]]
#[1] "Apples x 2 @ $7"   "Bread x 1 @ $7"    "Chicken x 3 @ $15" "Eggs x 2 @ $6"
#[5] "Bread x 2 @ $5"    "Chicken x 3 @ $15" "Seeds x 3 @ $2"

Add them as a new columns in the dataframe appending NAs for shorter length data.

n <- 1:max(lengths(tmp))
result <- cbind(df[1:2], t(sapply(tmp, `[`, n)))
result

#  Customer Items.Purchased               1              2                 3
#1     Jane               3   Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15
#2      Bob               1 Table x 1 @ $75           <NA>              <NA>
#3     Mark               7 Apples x 2 @ $7 Bread x 1 @ $7 Chicken x 3 @ $15

#              4              5                 6              7
#1          <NA>           <NA>              <NA>           <NA>
#2          <NA>           <NA>              <NA>           <NA>
#3 Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15 Seeds x 3 @ $2

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(Customer = c("Jane", "Bob", "Mark"), Items.Purchased = c(3L, 
1L, 7L), Items = c("Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15", 
"Table x 1 @ $75", "Apples x 2 @ $7Bread x 1 @ $7Chicken x 3 @ $15Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15Seeds x 3 @ $2"
)), row.names = c(NA, -3L), class = "data.frame")

CodePudding user response：

I'd do this in two steps: Mark where the splits should be then do the separate:

library(tidyverse)
     
fun_split <- function(x) {
       str_replace_all(x, "([0-9])([A-Z])", "\\1 - \\2")
     }

df <- as_tibble(df)

df$Items <- fun_split(df$Items)
    
    
df_new <-  df %>% 
      separate(Items, 
               into = c("Item 1", "Item 2", "Item 3", "Item 4", "Item 5", "Item 6", "Item 7"), 
               sep = " - ")

CodePudding user response：

Here is another way you could use. It should be noted that I used a regex pattern to match your desired sub strings but since it could also match the who string I used ? operator to make it lazy so that after it matches the shortest possible string it again looks for another one:

library(dplyr)
library(tidyr)

df %>%
  mutate(output = regmatches(Items, gregexpr("[A-Z].*?\\$\\d ", Items, perl = TRUE))) %>%
  as_tibble() %>%
  unnest_wider(output) %>% 
  setNames(gsub("(\\.){3}(\\d)", "Item\\2", names(.)))

# A tibble: 3 x 10
  Customer Items.Purchased Items              Item1   Item2   Item3  Item4  Item5  Item6  Item7 
  <chr>              <int> <chr>              <chr>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr> 
1 Jane                   3 Eggs x 2 @ $6Brea~ Eggs x~ Bread ~ Chick~ NA     NA     NA     NA    
2 Bob                    1 Table x 1 @ $75    Table ~ NA      NA     NA     NA     NA     NA    
3 Mark                   7 Apples x 2 @ $7Br~ Apples~ Bread ~ Chick~ Eggs ~ Bread~ Chick~ Seeds~