Home > front end >  Dataframe Manipulation in R: Separating Multiple Items in a Single Order between a Number and Letter
Dataframe Manipulation in R: Separating Multiple Items in a Single Order between a Number and Letter

Time:09-27

I have a huge dataframe of sales data with customer data and order information. Unfortunately for each customer entry, the items sold per customer are all smashed into together in one long string i.e. Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15. I've been messing around with dplyr and regex commands but it just creates the new columns.

newdf <- df %>%
         separate(V3, 
              into = c("Item 1", "Item 2", "Item 3", "Item 4", "Item 5", "Item 6", "Item 7"), 
          sep = "(?=[0-9])(?<=[A-Za-z])")

Input

V1 V2 V3
Customer Items Purchased Items
Jane 3 Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15
Bob 1 Table x 1 @ $75
Mark 7 Apples x 2 @ $7Bread x 1 @ $7Chicken x 3 @ $15Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15Seeds x 3 @ $2

Desired Output

V1 V2 V3 V4 V5 V6 V7 V8 V9
Customer Items Purchased Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7
Jane 3 Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15
Bob 1 Table x 1 @ $75
Mark 7 Apples x 2 @ $7 Bread x 1 @ $7 Chicken x 3 @ $15 Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15 Seeds x 3 @ $2

CodePudding user response:

You can create a regex pattern 'word x number @ $number' which can be written as '\\w x \\d @ \\$\\d ' and use it to extract the relevant text from Items column.

tmp <- stringr::str_extract_all(df$Items, '\\w  x \\d  @ \\$\\d ')
tmp

#[[1]]
#[1] "Eggs x 2 @ $6"     "Bread x 2 @ $5"    "Chicken x 3 @ $15"

#[[2]]
#[1] "Table x 1 @ $75"

#[[3]]
#[1] "Apples x 2 @ $7"   "Bread x 1 @ $7"    "Chicken x 3 @ $15" "Eggs x 2 @ $6"
#[5] "Bread x 2 @ $5"    "Chicken x 3 @ $15" "Seeds x 3 @ $2"   

Add them as a new columns in the dataframe appending NAs for shorter length data.

n <- 1:max(lengths(tmp))
result <- cbind(df[1:2], t(sapply(tmp, `[`, n)))
result

#  Customer Items.Purchased               1              2                 3
#1     Jane               3   Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15
#2      Bob               1 Table x 1 @ $75           <NA>              <NA>
#3     Mark               7 Apples x 2 @ $7 Bread x 1 @ $7 Chicken x 3 @ $15

#              4              5                 6              7
#1          <NA>           <NA>              <NA>           <NA>
#2          <NA>           <NA>              <NA>           <NA>
#3 Eggs x 2 @ $6 Bread x 2 @ $5 Chicken x 3 @ $15 Seeds x 3 @ $2

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(Customer = c("Jane", "Bob", "Mark"), Items.Purchased = c(3L, 
1L, 7L), Items = c("Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15", 
"Table x 1 @ $75", "Apples x 2 @ $7Bread x 1 @ $7Chicken x 3 @ $15Eggs x 2 @ $6Bread x 2 @ $5Chicken x 3 @ $15Seeds x 3 @ $2"
)), row.names = c(NA, -3L), class = "data.frame")

CodePudding user response:

I'd do this in two steps: Mark where the splits should be then do the separate:

library(tidyverse)
     
fun_split <- function(x) {
       str_replace_all(x, "([0-9])([A-Z])", "\\1 - \\2")
     }

df <- as_tibble(df)

df$Items <- fun_split(df$Items)
    
    
df_new <-  df %>% 
      separate(Items, 
               into = c("Item 1", "Item 2", "Item 3", "Item 4", "Item 5", "Item 6", "Item 7"), 
               sep = " - ")

CodePudding user response:

Here is another way you could use. It should be noted that I used a regex pattern to match your desired sub strings but since it could also match the who string I used ? operator to make it lazy so that after it matches the shortest possible string it again looks for another one:

library(dplyr)
library(tidyr)

df %>%
  mutate(output = regmatches(Items, gregexpr("[A-Z].*?\\$\\d ", Items, perl = TRUE))) %>%
  as_tibble() %>%
  unnest_wider(output) %>% 
  setNames(gsub("(\\.){3}(\\d)", "Item\\2", names(.)))

# A tibble: 3 x 10
  Customer Items.Purchased Items              Item1   Item2   Item3  Item4  Item5  Item6  Item7 
  <chr>              <int> <chr>              <chr>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr> 
1 Jane                   3 Eggs x 2 @ $6Brea~ Eggs x~ Bread ~ Chick~ NA     NA     NA     NA    
2 Bob                    1 Table x 1 @ $75    Table ~ NA      NA     NA     NA     NA     NA    
3 Mark                   7 Apples x 2 @ $7Br~ Apples~ Bread ~ Chick~ Eggs ~ Bread~ Chick~ Seeds~
  • Related