Home > OS >  Summarizing dataset in r to calculate unique values
Summarizing dataset in r to calculate unique values

Time:06-24

In the following dataframe I want to calculate unique household_id - individual_id combination and Average weights and Total Duration after summarizing at country, state and date column.

enter image description here

Household_id 100 have two unique individual(1and 2) and househld_id 101 have three unique individual(1,2,3). So total unique is 5 after summarizing.

The Average weights I want to calculate of this 5 unique individuals i.e. (100 50 200 200 200)/5 =150

Final dataset:

enter image description here

what I did is

data %>% group_by(country,state,date) %>% 
summarise(total_unique = n_distinct(household_id,individual_id),
Tot_Duration = sum(duration))

But not able to calculate the Average_weights. Any help is highly appreciated.

Sample Dataset

  library(dplyr)
  data <- data.frame(country = c("US","US","US","US","US","US","IND","IND"),
               state = c("TX","TX","TX","TX","TX","TX","AP","AP"),
               date = c(20220601,20220601,20220601,20220601,20220601,20220601,20220601,20220601),
               household_id = c(100,100,100,101,101,101,102,102),
               individual_id=c(1,2,1,1,2,3,1,1),
               weights = c(100,50,100,200,200,200,100,100),
               duration = c(10,20,30,40,50,60,70,80))

EDIT Apologies for not putting the right dataset which I realized later. Two update in Dataset

  1. Different individuals may have same weight as in household_id 101
  2. Duration column is added with solution 1 above distinct will not work and with solution 2 unique will not work. Please suggest

I have updated the sample dataset

CodePudding user response:

Building on to your code, you could add an extra statement in summarise:

library(tidyverse)

data %>%
  group_by(country,state,date) %>%
  summarise(total_unique = n_distinct(household_id,individual_id), 
            Average_weights = sum(unique(weights), na.rm = T)/total_unique)

Output

  country state     date total_unique Average_weights
  <chr>   <chr>    <dbl>        <int>           <dbl>
1 IND     AP    20220601            1             100
2 US      TX    20220601            5             210

CodePudding user response:

You may try

library(dplyr)
data %>%
  group_by(country, state, date) %>%
  distinct() %>%
  summarize(total_unique = n(),
            average_Weights = sum(weights)/total_unique)

  country state     date total_unique average_Weights
  <chr>   <chr>    <dbl>        <int>           <dbl>
1 IND     AP    20220601            1             100
2 US      TX    20220601            5             210
  •  Tags:  
  • r
  • Related