I'm working with two data frames A and B , so suppose in column 1 of df A i have clients ID's and in column 2 every purchase made by that client, a 800k by 2 df. In the other hand i have df B with column 1 clients ID's and column 2 a single purchase made by the client, a 300k by 2 df. Example A =
ID | Purchases |
---|---|
362 | shoes;shirt;... |
363 | pants;pants;... |
B=
ID | Purchase |
---|---|
362 | shoes |
362 | shirt |
363 | pants |
363 | pants |
If i want to concatenate each client purchase in column 2 of df A i have to do a 2 for loop to search for every id in df A ,every purchase in df B.
The sizes are 300k and 800k. I´ve transformed the df´s into lists to speed up mi code but it still very slow, it took 4459.75 elapsed , 3900.95 in user and 541.37 in system. Is there better data structures to use? To concatenate I'm using the paste function.
Here's the code :
for (i in 1:800000){
for(j in 1:300000){
if(A$ID[i]==B$ID[j]) A$purchases[i]= paste(A$purchases[i],B$purchase[j],sep=";")
}
}
CodePudding user response:
I think there are a few things to mention here.
Your operation on the face of it is a two-step (no loops required): take your
B
, summarize into one-row-per-user (collapse the strings into one), then join and append onA
's data. Both of those are simple steps on their own, so we'll walk through the basics of this.If you want to ensure uniqueness in
A
, we need to re-process every line. While one might go with regexes (with the assumption that a semicolon will never appear in real data), I think that can get a little expensive in the long run. (Granted, nowhere near as expensive as a double-for
-loop, nothing I suggest will be within an order of magnitude of that slow.) Another option is to convertPurchases
into a list-column, for which we can do simpler (less-slow) operations for set-membership, etc.
On the face of it, try this:
library(dplyr)
B %>%
group_by(ID) %>%
summarize(NewPurchases = paste(Purchase, collapse = ";"))
# # A tibble: 1 x 2
# ID NewPurchases
# <int> <chr>
# 1 362 shoes;shirt
this we can join back onto A
, concatenate, and ...
B %>%
group_by(ID) %>%
summarize(NewPurchases = paste(Purchase, collapse = ";")) %>%
full_join(A, ., by = "ID") %>%
mutate(Purchases = paste(na.omit(c(Purchases, NewPurchases)), collapse = ";")) %>%
select(-NewPurchases)
# ID Purchases
# 1 362 shoes;shirt;...;shoes;shirt
I think that list-columns are still a better way to go, for two reasons:
- No need to worry about escaping
;
(if ever present), the contents of that column on any single row is a vector ofcharacter
, 0 or more, so each "purchase" is its own element within the vector; - We can use set-membership or uniqueness or similar when combining. This also includes tabulating (summarizing) by purchase item.
Let's first convert A$Purchases
to a list-column:
A <- mutate(A, Purchases = strsplit(Purchases, ";"))
A
# ID Purchases
# 1 362 shoes, shirt, ...
str(A)
# 'data.frame': 1 obs. of 2 variables:
# $ ID : int 362
# $ Purchases:List of 1
# ..$ : chr "shoes" "shirt" "..."
From here, the revised workflow to join in B
data:
newA <- B %>%
group_by(ID) %>%
summarize(NewPurchases = list(Purchase)) %>%
full_join(A, ., by = "ID") %>%
mutate(Purchases = Map(c, Purchases, NewPurchases)) %>%
### or, if you want to guarantee uniqueness
# mutate(Purchases = Map(function(a, b) unique(c(a, b)), Purchases, NewPurchases)) %>%
select(-NewPurchases)
newA
# ID Purchases
# 1 362 shoes, shirt, ..., shoes, shirt, pants
From here, tabulating purchases can be as easy as:
table(unlist(newA$Purchases))
# ... pants shirt shoes
# 1 1 2 2
or per-ID
with
xtabs(~ ID Purchases, data = tidyr::unnest(newA, Purchases))
# Purchases
# ID ... pants shirt shoes
# 362 1 1 2 2