I am currently working on a programming puzzle that sounds straightforward, but apparently it is pretty difficult if I want to do this efficiently in R without having to use for
loop to go through a column with 100k rows within a data-frame. I am trying to apply dplyr
(particularly group_by
and mutate
) or data.table
, and -apply
family, but it's quite tough. Could anyone give some help?
The problem is as follows: given a data-frame df
with columns key
("string" data type), x
, y
, z
(all are "numeric" data type). Some elements within column key
might get repeated. The rule set is as follows: for every rows with the same value in key
column, we determine whether their values in column x
are smaller than the sum of elements in column y
(for example, with key = aa_bb_1
, there are 6 rows with this key, and all of these rows always have the same value in column x
. Please see the Sample Output to see how the rule works). If it is, then keep that value in column x
, while distributing the element in column x
to elements in column y
in a decreasing order based on corresponding values in column z
. How do we effectively do this given that we need to go through all distinct elements in column key
?
Sample Input
df <- data.frame(key = c('aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_1', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_0', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_0', 'abbbc_aa_0', 'aaa_ccc_1',
'aa_bb_0', 'aa_bb_1', 'ab_ca_1', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1'),
x = c(20, 19, 30, 25, 37, 13, 20, 20, 30, 25, 37, 13, 19, 20, 30, 43,
71, 13, 19, 20, 10, 25, 37, 13, 20, 19, 30, 25, 37,13),
y = c(3, 10, 18, 15, 32, 4, 12, 29, 71, 92, 11, 7, 21, 19, 13,
26,28,11,8, 8, 5, 23, 3, 12, 19, 7, 9, 11, 7, 12),
z = c(8,13,15,16,10,10,25,21,32,15,45,8,10,50,12,10,35,
23,10,12,2,40,45,57,66,49,100,5,11,30))
key x y z
1 aa_bb_1 20 3 8
2 aa_bb_0 19 10 13
3 ab_ca_0 30 18 15
4 abc_bbb_1 25 15 16
5 abbbc_aa_1 37 32 10
6 aaa_ccc_1 13 4 10
7 aa_bb_1 20 12 25
8 aa_bb_1 20 29 21
9 ab_ca_0 30 71 32
10 abc_bbb_1 25 92 15
11 abbbc_aa_1 37 11 45
12 aaa_ccc_1 13 7 8
13 aa_bb_0 19 21 10
14 aa_bb_1 20 19 50
15 ab_ca_0 30 13 12
16 abc_bbb_0 43 26 10
17 abbbc_aa_0 71 28 35
18 aaa_ccc_1 13 11 23
19 aa_bb_0 19 8 10
20 aa_bb_1 20 8 12
21 ab_ca_1 10 5 2
22 abc_bbb_1 25 23 40
23 abbbc_aa_1 37 3 45
24 aaa_ccc_1 13 12 57
25 aa_bb_1 20 19 66
26 aa_bb_0 19 7 49
27 ab_ca_0 30 9 100
28 abc_bbb_1 25 11 5
29 abbbc_aa_1 37 7 11
30 aaa_ccc_1 13 12 30
Sample Output for aa_bb_1 and aa_bb_0
key x y z
1 aa_bb_1 20 0 8
2 aa_bb_0 19 10 13 -- Second largest value of z among rows with same key aa_bb_0. Get second distribution equal to min(10,19-7)=min(10,12)=10.
7 aa_bb_1 20 0 25
8 aa_bb_1 20 0 21 -- Nothing left to be distributed => 0 in column y.
13 aa_bb_0 19 0 10 --- Nothing left so distribute 0
14 aa_bb_1 20 1 50 --- Second largest value of z among rows with same key aa_bb_1. So distribute min(19,20-19)=1 to column y.
19 aa_bb_0 19 2 10 --- Tie as third largest value of z among rows with same key aa_bb_0. Pick *randomly* for now (in reality, I would have another column to decide on which row would get distributed first). Since min(8,19-7-10)=min(8,2)=2, only 2 is distributed.
20 aa_bb_1 20 0 12
25 aa_bb_1 20 19 66 --- Largest value of z among rows with same key aa_bb_1. Get first distribution = min(20, 19)=19.
26 aa_bb_0 19 7 49 --- Largest value of z among rows with same key aa_bb_0. Get first distribution equal to min(7,19)=7.
Caveat. Only perform the above operations if the sum of all the elements in column z
with the same key
is greater than the value in column x
of that key.
Example includes aa_bb_1
where x
= 20 < 3 19 8 19
CodePudding user response:
Pretty much anything you can do with a for loop.
Here I apply a function to the data.frame split by key, that function being a for loop. Then I assign the output to the ordered df, because the split data frame loop output is ordered by key.
df <- dplyr::arrange(df, key, desc(z))
df$y <- lapply(split(df, df$key), \(x) {
ndf <- x
base <- min(ndf$x)
#out values for y
yout = list()
for (i in seq(nrow(x))) {
##get the max
maxz <- which.max(ndf$z)
##get the minimum
minv <- min(base, ndf$y[maxz])
#add to yout
yout[[i]] <- minv
#new base
base <- base - minv
##update dataframe
ndf <- ndf[-maxz, ]
}
return(yout)
}) |> unlist()
key x y z
1 aa_bb_0 19 7 49
2 aa_bb_0 19 10 13
3 aa_bb_0 19 2 10
4 aa_bb_0 19 0 10
5 aa_bb_1 20 19 66
6 aa_bb_1 20 1 50
7 aa_bb_1 20 0 25
8 aa_bb_1 20 0 21
9 aa_bb_1 20 0 12
10 aa_bb_1 20 0 8
11 aaa_ccc_1 13 12 57
12 aaa_ccc_1 13 1 30
13 aaa_ccc_1 13 0 23
14 aaa_ccc_1 13 0 10