I have a dataset that looks something like this:
name = c("john", "john", "john", "alex","alex", "tim", "tim", "tim", "ralph", "ralph")
year = c(2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)
my_data = data.frame(name, year)
name year
1 john 2010
2 john 2011
3 john 2012
4 alex 2011
5 alex 2012
6 tim 2010
7 tim 2011
8 tim 2012
9 ralph 2014
10 ralph 2016
I want to count the two following things in this dataset:
-
- Groups based on all years
-
- And of these groups, the number of groups with at least one non-consecutive year
As an example for 1):
# sample output for 1)
year count
1 2010, 2011, 2012 2
2 2011, 2012 1
3 2014, 2016 1
And as an example of 2) - only row 3 (in the above data frame) contains a missing year (i.e. 2014 to 2016 without 2015). Thus, the output would look something like this:
# sample output for 2)
year count
1 2014, 2016 1
Can someone please show me how to do this in R? And is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?
CodePudding user response:
Here are base R solutions.
# 1.
agg <- aggregate(year ~ name, my_data, paste, collapse = ", ")
as.data.frame(table(agg$year))
#> Var1 Freq
#> 1 2010, 2011, 2012 2
#> 2 2011, 2012 1
#> 3 2014, 2016 1
# 2.
agg <- aggregate(year ~ name, my_data, c)
agg <- agg$year[sapply(agg$year, \(y) any(diff(y) != 1))]
as.data.frame(table(sapply(agg, paste, collapse = ", ")))
#> Var1 Freq
#> 1 2014, 2016 1
# final clean up
rm(agg)
Created on 2022-12-03 with reprex v2.0.2
Edit
Answering to the comment/request,
Is there a way to make sure that (2011, 2012) is considered the same as (2012, 2011) ?
a way is to, in each group of name
, first sort the data by year. Then run the code above.
my_data <- my_data[order(my_data$name, my_data$year), ]
CodePudding user response:
Here is solution with dplyr
and tidyr
:
library(dplyr)
library(tidyr)
### 1.
my_data %>%
group_by(name) %>%
mutate(year = toString(year)) %>%
distinct(year) %>%
ungroup() %>%
count(year, name="count")
year count
<chr> <int>
1 2010, 2011, 2012 2
2 2011, 2012 1
3 2014, 2016 1
### 2.
my_data %>%
group_by(name) %>%
mutate(x = lead(year) - year) %>%
fill(x, .direction = "down") %>%
ungroup () %>%
filter(x >= max(x)) %>%
mutate(year = toString(year)) %>%
distinct(year) %>%
ungroup() %>%
count(year, name="count")
year count
<chr> <int>
1 2014, 2016 1
CodePudding user response:
Using toString
and sort
in by
.
by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
# Var1 Freq
# 1 2010, 2011, 2012 2
# 2 2011, 2012 1
# 3 2014, 2016 1
Order doesn't matter:
set.seed(42)
my_data <- my_data[sample(nrow(my_data)), ]
by(my_data$year, my_data$name, \(x) toString(sort(x))) |> table() |> as.data.frame()
# Var1 Freq
# 1 2010, 2011, 2012 2
# 2 2011, 2012 1
# 3 2014, 2016 1
Data:
my_data <- structure(list(name = c("john", "john", "john", "alex", "alex",
"tim", "tim", "tim", "ralph", "ralph"), year = c(2010, 2011,
2012, 2011, 2012, 2010, 2011, 2012, 2014, 2016)), class = "data.frame", row.names = c(NA,
-10L))