Tidy way of comparing "tiles" of users-CodePudding

Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.

library(tidyverse)

df <- bind_rows(
    tibble(group = "A", x = rpois(100, 1)),
    tibble(group = "B", x = rpois(200, 2))
) %>% 
    count(group, x)

I want to compare tiles of users. By tile, I mean users in group A that have the same x value. For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...

Here's my try:

n_fake <- 1000

df_agg_per_imp <- df %>% 
    group_by(group) %>% 
    mutate(
        p_max = n_fake * cumsum(n) / sum(n),
        p_min = lag(p_max, default = 0),
        p = map2(p_min   1, p_max, seq)
    ) %>%
    ungroup()

df_agg_per_imp %>% 
    unnest(p) %>% 
    pivot_wider(id_cols = p, names_from = group, values_from = x) %>% 
    group_by(A) %>% 
    summarise(
        p_min = min(p) / n_fake,
        p_max = max(p) / n_fake,
        rel_uplift = mean(B) / mean(A)
    )

#> # A tibble: 6 × 4
#>       A p_min p_max rel_uplift
#>   <int> <dbl> <dbl>      <dbl>
#> 1     0 0.001  0.34     Inf   
#> 2     1 0.341  0.74       1.92
#> 3     2 0.741  0.91       1.57
#> 4     3 0.911  0.96       1.33
#> 5     4 0.961  0.99       1.21
#> 6     5 0.991  1          1.2

What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?

CodePudding user response：

You may be able to do something like this:

extend the creation of your initial frame to get proportion in A and B, and pivot wider:

set.seed(123)
df <- bind_rows(
  tibble(group = "A", x = rpois(100, 1)),
  tibble(group = "B", x = rpois(200, 2))
) %>% 
  count(group, x) %>% 
  group_by(group) %>%
  mutate(prop = n/sum(n)) %>%
  pivot_wider(id_cols=x, names_from=group,values_from=prop)

With the seed above, this gives you a frame like this:

# A tibble: 7 x 3
      x     A     B
  <int> <dbl> <dbl>
1     0  0.35 0.095
2     1  0.38 0.33 
3     2  0.21 0.285
4     3  0.04 0.14 
5     4  0.02 0.085
6     5 NA    0.055
7     6 NA    0.01

Create a function estimates the rel_uplift, while also returning an updated set of group B proportions and group B values (i.e. xvalues)

f <- function(a,aval,bvec,bvals) {
  cindex = which(cumsum(bvec)>=a)
  if(length(cindex) == 0) bindex=seq_along(bvec)
  else bindex= 1:min(cindex)
  rem = sum(bvec[bindex])-a
  bmean = sum(bvals[bindex] * (bvec[bindex] - c(rep(0,length(bindex)-1), rem)))
  if(length(bindex)>1) {
    if(rem!=0) bindex = bindex[1:(length(bindex)-1)]
    bvec = bvec[-bindex]
    bvals = bvals[-bindex]
  } 
  bvec[1] = rem
  list("rel_uplift" = bmean/(a*aval),"bvec" = bvec, "bvals" = bvals )
}

Initiate a dataframe, and a list called fres which contains the initial bvec and initial bvals

result=data.frame()
fres = list("bvec" = df$B,"bvals" = df$x)

Use a for loop to loop over the values of df$A, each time getting the rel_uplift, and preparing an updated set of bvec and bvals to be used in the function

for(a in df %>% filter(!is.na(A)) %>%  pull(A)) {
  x = df %>% filter(A==a) %>% pull(x)
  fres = f(a, x,fres[["bvec"]],fres[["bvals"]])
  result = rbind(result,data.frame(x =x, A=a,rel_uplift=fres[["rel_uplift"]]))
}

result

  x    A rel_uplift
1 0 0.35        Inf
2 1 0.38   1.855263
3 2 0.21   1.726190
4 3 0.04   1.666667
5 4 0.02   1.375000

CodePudding user response：

If I understand right you want to compare counts by two parameters simultaneously, ie by $group and by $x. From the example in the initial post I see that not all values $x may be available for each group.

Summarizing by 2 co-variables can be done with base R. Here a simple function (assuming that you're always looking at $group and $x):

countnByGroup <- function(xx, asPercent=FALSE) {
  lev <- unique(xx$x)
  grp <- unique(xx$group)
  out <- sapply(grp, function(x) {z <- rep(NA, length(lev)); names(z) <- lev
    w <- which(xx$group==x); if(length(w) >0) z[match(xx$x[w], lev)] <- xx$n[w]
    z }) 
  if(asPercent) out <- 100*apply(out, 2, function(x) x/sum(x, na.rm=TRUE))
  out }

Note, in the function above the man variable was called 'xx' to avoid confusion with $x.

df       # produced using the code from your example
## A tibble: 13 x 3
#   group     x     n
#   <chr> <int> <int>
# 1 A         0    36
# 2 A         1    38
# 3 A         2    19
# 4 A         3     6
# 5 A         4     1
# 6 B         0    27
# 7 B         1    44
# 8 B         2    55
# 9 B         3    44
#10 B         4    21
#11 B         5     6
#12 B         6     2
#13 B         8     1

One gets :

countnByGroup(df)
#   A  B
#0 36 27
#1 38 44
#2 19 55
#3  6 44
#4  1 21
#5 NA  6
#6 NA  2
#8 NA  1

## and
countnByGroup(df, asPercent=T)
#   A    B
#0 36 13.5
#1 38 22.0
#2 19 27.5
#3  6 22.0
#4  1 10.5
#5 NA  3.0
#6 NA  1.0
#8 NA  0.5

As long as you don't apply any rounding you'll have the results as precise as it gets. By chance the random values from above did't produce more digits when processing and thus by chance the percent values for A are all integers.

Another interesting option may be to consider two-way tables in R using table(). But in this case you need your entries as separate lines and not already transformed to counting data as in your example above.