I have the following dataframe and want to group by the grp
column to see how many of each column-value appears in each group.
> data.frame(grp = unlist(strsplit("aabbccca", "")), col1=unlist(strsplit("ABAABBAB", "")), col2=unlist(strsplit("BBCCCCDD", "")))
grp col1 col2
1 a A B
2 a B B
3 b A C
4 b A C
5 c B C
6 c B C
7 c A D
8 a B D
Desired result:
grp col1A col1B col2B col2C col2D
1 a 1 2 2 0 1
2 b 2 0 0 2 0
3 c 1 2 0 2 1
If I only look at the grp
and col1
columns, it is easy to solve this using table()
and when there are only 2 columns, I could merge table(df[c('grp', 'col1')])
with table(df[c('grp', 'col2')])
. However, this gets extremely cumbersome as the number of factor columns grows, and is problematic if there are shared values between col1
and col2
.
Note that dplyr's count doesn't work, as it looks for unique combinations of the col1 and col2.
I've tried melting and spreading the dataframe using tidyr, without any luck
> pivot_longer(df, c(col1, col2), names_to= "key", values_to = "val") %>% pivot_wider("grp", names_from = c("key", "val"), values_from = 1, values_fn = sum)
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Column `grp` doesn't exist.
I can find plenty of solutions that work for the case where I have 1 group column and 1 value column, but I can't figure out how to generalize them to more columns.
CodePudding user response:
You can stack col1
& col2
together, count the number of each combination, and then transform the table to a wide form.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(col1:col2) %>%
count(grp, name, value) %>%
pivot_wider(grp, names_from = c(name, value), names_sort = TRUE,
values_from = n, values_fill = 0)
# A tibble: 3 x 6
grp col1_A col1_B col2_B col2_C col2_D
<chr> <int> <int> <int> <int> <int>
1 a 1 2 2 0 1
2 b 2 0 0 2 0
3 c 1 2 0 2 1
A base
solution (Thank @GKi to refine the code):
table(cbind(df["grp"], col=do.call(paste0, stack(df[-1])[2:1])))
col
grp col1A col1B col2B col2C col2D
a 1 2 2 0 1
b 2 0 0 2 0
c 1 2 0 2 1
CodePudding user response:
Use recast
from reshape2
package:
reshape2::recast(df, grp~variable value,id.var = 'grp', fun = length)
grp col1_A col1_B col2_B col2_C col2_D
1 a 1 2 2 0 1
2 b 2 0 0 2 0
3 c 1 2 0 2 1
In base R you could do:
with(df, cbind(table(grp, paste0('col1_', col1)), table(grp, paste0('col2_', col2))))
col1_A col1_B col2_B col2_C col2_D
a 1 2 2 0 1
b 2 0 0 2 0
c 1 2 0 2 1
If you have many columns consider doing:
do.call(cbind, Map(function(x, y) table(df$grp, paste(x,y, sep = '_')),
names(df)[-1], df[,-1]))
col1_A col1_B col2_B col2_C col2_D
a 1 2 2 0 1
b 2 0 0 2 0
c 1 2 0 2 1
You can then turn this to a dataframe
CodePudding user response:
You were on the right track with melt
and spread
. Here's a tidyverse solution. I first use pivot_longer
to generalise to an arbitrary number of columns and then pivot_wider
to return to the desired output format. The order of columns in the output data frame is data dependent. If this is an issue, simply append a select
to the end of the pipe to obtain the desired order. (Or use names_sort
as in @DarrenTsai's answer.)
library(tidyverse)
d %>%
pivot_longer(
starts_with("col"),
names_to="Column",
values_to="Value"
) %>%
group_by(grp, Column, Value) %>%
summarise(N=n(), .groups="drop") %>%
group_by(grp) %>%
pivot_wider(
id_cols=grp,
values_from=N,
names_from=c(Column, Value),
names_sep="",
values_fill=0
) %>%
ungroup()
# A tibble: 3 × 6
grp col1A col1B col2B col2D col2C
<chr> <int> <int> <int> <int> <int>
1 a 1 2 2 1 0
2 b 2 0 0 0 2
3 c 1 2 0 1 2
CodePudding user response:
Another possible solution, based on a tidyr::pivot_longer
followed by a tidyr::pivot_wider
and using values_fn = length
:
library(tidyverse)
df %>%
pivot_longer(c(col1, col2)) %>%
mutate(name = str_c(name, value)) %>%
pivot_wider(grp, values_fn = length, values_fill = 0, names_sort = T)
#> # A tibble: 3 x 6
#> grp col1A col1B col2B col2C col2D
#> <chr> <int> <int> <int> <int> <int>
#> 1 a 1 2 2 0 1
#> 2 b 2 0 0 2 0
#> 3 c 1 2 0 2 1
CodePudding user response:
In data.table
, we can use dcast
melt
like below
dcast(
melt(setDT(df), id.vars = "grp")[
, value := paste(variable, value, sep = "_")
], grp ~ value
)
to produce
grp col1_A col1_B col2_B col2_C col2_D
1: a 1 2 2 0 1
2: b 2 0 0 2 0
3: c 1 2 0 2 1