I have the following tibble of data:
df1 <- tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2011END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
website = c(
"google",
"google",
"facebook",
"facebook",
"youtube"
),
method = c("website",
"phone",
"website",
"laptop",
"phone"),
values = c(1, NA, 1, 2, 3))
And a tibble of qaurtile ranks:
df2 <-
structure(
list(
method = c(
"phone",
"phone",
"phone",
"website",
"website",
"website",
"laptop",
"laptop",
"laptop"
),
rank = c(3,
2, 1, 3, 2, 1, 3, 2, 1),
tile_condition = c(
"lowest 25%",
"25 to 50%",
"more than 50%",
"highest 25%",
"25 to 75%",
"less than 25%",
"lowest 25%",
"25 to 50%",
"more than 50%"
)
),
class = c("tbl_df",
"tbl", "data.frame"),
row.names = c(NA, -9L)
)
I'm trying to find a way of finding of creating a dataframe that has the ranks like in df2 based on the values in df1
However I would like to calculate the quartiles with this grouping:
df1 %>% group_by(period, method)
So that the quartiles are created against the grouping of period and method e.g. for the period 2010END and the methods - find the quartiles for those values. Then left join the ranking scores from df2 there so we can tell where those values lie in the quartiles
Is there a way to do this in tidyverse?
CodePudding user response:
Your sample df1
is too small to demonstrate the solution so this creates an artificial dataset structured like yours.
library(data.table)
##
# this creates an artificial dataset...
# with enough data to demonstrate the solution
#
df1 <- data.table(period=c('2010END', '2011END'))
df1 <- df1[, .(method=c('phone', 'website', 'laptop')), by=.(period)]
df1 <- df1[, .(values=rpois(sample(50:100, 1), 10)), by=.(period, method)]
##
# you start here...
#
setDT(df1) # with your df1 will need this to convert to data.table
labs <- c('q1', 'q2', 'q3', 'q4')
df1[, tile_condition:=cut(values, breaks= quantile(values, probs = c(0, .25, 0.50, 0.75, 1.0)),
labels=labs, include.lowest = TRUE)
, by=.(period, method)]
CodePudding user response:
A tidyverse solution :
## 1.We group the data and create bins with cut function (3 breaks as requested)
## 2.We transform the intervals (factor data) to numbers with dense_rank function
## 3.We recover the description of each rank from the 2nd dataframe
df1 %>%
group_by(period,method) %>%
mutate(q=cut(values,
breaks = unique(quantile(values, probs = c(0,.25, 0.50,1))),
include.lowest = TRUE)) %>%
arrange(q) %>%
mutate(rank=dense_rank(q)) %>%
left_join(y = df2,by = c("method","rank"))