Home > Software design >  Counting Number of Unique Column Values Per Group
Counting Number of Unique Column Values Per Group

Time:12-03

I have a dataset that looks something like this:

name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)

    name year
1   john 2010
2   john 2011
3   john 2012
4   alex 2011
5   alex 2012
6    tim 2010
7    tim 2011
8    tim 2012
9  ralph 2014
10 ralph 2016

I want to count the two following things in this dataset:

    1. Groups based on all years
    1. And of these groups, the number of groups with at least one non-consecutive year

As an example for 1):

# sample output for 1)

              year count
1 2010, 2011, 2012     2
2       2011, 2012     1
3       2014, 2016     1

And as an example of 2) - only row 3 (in the above data frame) contains a missing year (i.e. 2014 to 2016 without 2015). Thus, the output would look something like this:

# sample output for 2)

              year count
1       2014, 2016     1

Can someone please show me how to do this in R? And is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ? 

CodePudding user response:

Here are base R solutions.

# 1.
agg <- aggregate(year ~ name, my_data, paste, collapse = ", ")
as.data.frame(table(agg$year))
#>               Var1 Freq
#> 1 2010, 2011, 2012    2
#> 2       2011, 2012    1
#> 3       2014, 2016    1

# 2.
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
#>         Var1 Freq
#> 1 2014, 2016    1

# final clean up
rm(agg)  

Created on 2022-12-03 with reprex v2.0.2


Edit

Answering to the comment/request,

Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?

a way is to, in each group of name, first sort the data by year. Then run the code above.

my_data <- my_data[order(my_data$name, my_data$year), ]

CodePudding user response:

Here is solution with dplyr and tidyr:

library(dplyr)
library(tidyr)

### 1.
my_data %>% 
  group_by(name) %>%  
  mutate(year = toString(year)) %>% 
  distinct(year) %>% 
  ungroup() %>% 
  count(year, name="count")

year             count
<chr>            <int>
1 2010, 2011, 2012     2
2 2011, 2012           1
3 2014, 2016           1

### 2. 
my_data %>% 
  group_by(name) %>% 
  mutate(x = lead(year) - year) %>% 
  fill(x, .direction = "down") %>% 
  ungroup () %>% 
  filter(x >= max(x)) %>% 
  mutate(year = toString(year)) %>% 
  distinct(year) %>% 
  ungroup() %>% 
  count(year, name="count")

year       count
<chr>      <int>
1 2014, 2016     1

CodePudding user response:

Using toString and sort in by.

by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
#               Var1 Freq
# 1 2010, 2011, 2012    2
# 2       2011, 2012    1
# 3       2014, 2016    1

Order doesn't matter:

set.seed(42)
my_data <-  my_data[sample(nrow(my_data)), ]
by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
#               Var1 Freq
# 1 2010, 2011, 2012    2
# 2       2011, 2012    1
# 3       2014, 2016    1

Data:

my_data <- structure(list(name = c("john", "john", "john", "alex", "alex", 
"tim", "tim", "tim", "ralph", "ralph"), year = c(2010, 2011, 
2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)), class = "data.frame", row.names = c(NA, 
-10L))
  • Related