How can I optimize my r code to go faster-CodePudding

I'm working with two data frames A and B , so suppose in column 1 of df A i have clients ID's and in column 2 every purchase made by that client, a 800k by 2 df. In the other hand i have df B with column 1 clients ID's and column 2 a single purchase made by the client, a 300k by 2 df. Example A =

ID	Purchases
362	shoes;shirt;...
363	pants;pants;...

ID	Purchase
362	shoes
362	shirt
363	pants
363	pants

If i want to concatenate each client purchase in column 2 of df A i have to do a 2 for loop to search for every id in df A ,every purchase in df B.

The sizes are 300k and 800k. I´ve transformed the df´s into lists to speed up mi code but it still very slow, it took 4459.75 elapsed , 3900.95 in user and 541.37 in system. Is there better data structures to use? To concatenate I'm using the paste function.

Here's the code :

for (i in 1:800000){
    for(j in 1:300000){
        if(A$ID[i]==B$ID[j]) A$purchases[i]= paste(A$purchases[i],B$purchase[j],sep=";")
    }
}

CodePudding user response：

I think there are a few things to mention here.

Your operation on the face of it is a two-step (no loops required): take your B, summarize into one-row-per-user (collapse the strings into one), then join and append on A's data. Both of those are simple steps on their own, so we'll walk through the basics of this.
If you want to ensure uniqueness in A, we need to re-process every line. While one might go with regexes (with the assumption that a semicolon will never appear in real data), I think that can get a little expensive in the long run. (Granted, nowhere near as expensive as a double-for-loop, nothing I suggest will be within an order of magnitude of that slow.) Another option is to convert Purchases into a list-column, for which we can do simpler (less-slow) operations for set-membership, etc.

On the face of it, try this:

library(dplyr)
B %>%
  group_by(ID) %>%
  summarize(NewPurchases = paste(Purchase, collapse = ";"))
# # A tibble: 1 x 2
#      ID NewPurchases
#   <int> <chr>       
# 1   362 shoes;shirt

this we can join back onto A, concatenate, and ...

B %>%
  group_by(ID) %>%
  summarize(NewPurchases = paste(Purchase, collapse = ";")) %>%
  full_join(A, ., by = "ID") %>%
  mutate(Purchases = paste(na.omit(c(Purchases, NewPurchases)), collapse = ";")) %>%
  select(-NewPurchases)
#    ID                   Purchases
# 1 362 shoes;shirt;...;shoes;shirt

I think that list-columns are still a better way to go, for two reasons:

No need to worry about escaping ; (if ever present), the contents of that column on any single row is a vector of character, 0 or more, so each "purchase" is its own element within the vector;
We can use set-membership or uniqueness or similar when combining. This also includes tabulating (summarizing) by purchase item.

Let's first convert A$Purchases to a list-column:

A <- mutate(A, Purchases = strsplit(Purchases, ";"))
A
#    ID         Purchases
# 1 362 shoes, shirt, ...
str(A)
# 'data.frame': 1 obs. of  2 variables:
#  $ ID       : int 362
#  $ Purchases:List of 1
#   ..$ : chr  "shoes" "shirt" "..."

From here, the revised workflow to join in B data:

newA <- B %>%
  group_by(ID) %>%
  summarize(NewPurchases = list(Purchase)) %>%
  full_join(A, ., by = "ID") %>%
  mutate(Purchases = Map(c, Purchases, NewPurchases)) %>%
  ### or, if you want to guarantee uniqueness
  # mutate(Purchases = Map(function(a, b) unique(c(a, b)), Purchases, NewPurchases)) %>%
  select(-NewPurchases)
newA
#    ID                              Purchases
# 1 362 shoes, shirt, ..., shoes, shirt, pants

From here, tabulating purchases can be as easy as:

table(unlist(newA$Purchases))
#   ... pants shirt shoes 
#     1     1     2     2

or per-ID with

xtabs(~ ID   Purchases, data = tidyr::unnest(newA, Purchases))
#      Purchases
# ID    ... pants shirt shoes
#   362   1     1     2     2