Home > Mobile >  Create new columns based on count of occurrence in another data frame in r
Create new columns based on count of occurrence in another data frame in r

Time:03-23

I want to create new columns in df1 based on count of occurrence of columns in df2

df1:

df2:

[![enter image description here][2]][2]

For ID 100 in data frame 1 no RM assigned in data frame 2, for ID 103 there are 2 RM in dataframe 2 and for 108 there are 3 RM

So my final data frame is

enter image description here

I tried merge function using left join but not sure how to count the numbers of non empty cells

Here is the sample dataset

  df1 <- data.frame(Id = c(100,101,103,105,108,109),
               channel = c("A","C","C","C","D","D"),
               duration = c(12,23,56,89,73,76))

  df2 <- data.frame(ID=c(100,103,109,105,101,108),
                RM1= c("","john","","Miller","","Maddy"),
                RM2 = c("","Ryan","","","","sean"),
                RM3 = c("","","","","","Arvind"))

CodePudding user response:

Add the aggregate in df2 and merge

df2$Total=rowSums(df2[,-1]!="")

merge(
  df1,
  df2[c("ID","Total")],
  by="ID",
  all.x=T
)

   ID channel duration Total
1 100       A       12     0
2 101       C       23     0
3 103       C       56     2
4 105       C       89     1
5 108       D       73     3
6 109       D       76     0

CodePudding user response:

Here is a tidyverse solution. Join by ID, reshape to long format and count the values not left blank.

df1 <- data.frame(ID = c(100,101,103,105,108,109),
                  channel = c("A","C","C","C","D","D"),
                  duration = c(12,23,56,89,73,76))

df2 <- data.frame(ID=c(100,103,109,105,101,108),
                  RM1= c("","john","","Miller","","Maddy"),
                  RM2 = c("","Ryan","","","","sean"),
                  RM3 = c("","","","","","Arvind"))

suppressPackageStartupMessages(library(dplyr))
library(tidyr)

inner_join(df1, df2) %>%
  pivot_longer(cols = starts_with("RM")) %>%
  group_by(ID, channel, duration) %>%
  summarise(Total = sum(value != ""), .groups = "drop")
#> Joining, by = "ID"
#> # A tibble: 6 x 4
#>      ID channel duration Total
#>   <dbl> <chr>      <dbl> <int>
#> 1   100 A             12     0
#> 2   101 C             23     0
#> 3   103 C             56     2
#> 4   105 C             89     1
#> 5   108 D             73     3
#> 6   109 D             76     0

Created on 2022-03-23 by the reprex package (v2.0.1)

  •  Tags:  
  • r
  • Related