I am working with the R programming language. I have the following dataset:
id = sample.int(50,1000, replace = TRUE)
years <- c("2010", "2011", "2013", "2014", "2015")
year<- sample(years, 1000, prob=c(0.2,0.2,0.2,0.2, 0.2), replace = TRUE)
my_data = data.frame(id, year)
my_data <- my_data[order(id),]
For example, patient # 1 visited the hospital twice in 2010, once in 2011, no times in 2012 ... patient # 2 visited the hospital no times in 2010, 5 times in 2011, 3 times in 2012, etc.
For this dataset, I want to find out the number of the number of times each combination of "years" appears (not including duplicates). For example - since there are 5 possible years, there will be 2^5 = 32 possible combinations :
# sample output of the final result
combinations frequency
1 2010 11
2 2011 9
3 2012 5
4 2013 1
5 2014 19
6 2015 11
7 2011,2012 9
8 2011, 2012, 2013 5
9 2013, 2015 1
10 2010,2011,2012,2013,2014,2015 19
This would be the equivalent of finding out:
- How many patients visited the hospital only in 2010?
- How many patients visited the hospital in 2010 and 2013?
- etc.
I tried to find a way to do this different ways:
# Method 1: Did not work
library(data.table)
final = setDT(my_data)[,list(Count=.N) ,names(my_data)]
# Method 2: Did not work
final = = aggregate(.~year,rev(aggregate(.~id,my_data,paste0,collapse = ",")),length)
# Method 3: Not sure if this is correct? There are 50 rows (instead of 32) and all frequency counts are 1?
library(dplyr)
library(stringr)
df = my_data %>%
group_by(id) %>%
summarise(years = str_c(year, collapse=",")) %>%
count(years)
My Question: Can someone please show me how to fix this?
Thanks!
CodePudding user response:
You need to first get the combinations of years for each patient and then count them
# with data.table
setDT(my_data)
my_data_agg<-my_data[,.(combinations=paste(unique(year), collapse=",")), by=.(id)]
my_data_agg[,.N, by=.(combinations)]
CodePudding user response:
I think you want something like this:
library(dplyr)
my_data %>%
group_by(id, year) %>%
summarise(frequency = n()) %>%
group_by(id) %>%
summarise(combinations = str_c(year, collapse=",")) %>%
count(combinations)
Output:
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 2 × 2
combinations n
<chr> <int>
1 2010,2011,2013,2014,2015 49
2 2010,2011,2013,2015 1
This means that there is one patient, which does not went to the hospital in 2014. And it seems that all patient do not went to the hospital in 2012.
By using this code, we can check which patient did not went to the hospital in 2014:
my_data %>%
group_by(id) %>%
filter(!all(2014 %in% year)) %>%
ungroup %>%
distinct(id) %>%
pull(id)
Output:
[1] 12
So patient 12 did not went to hospital in 2014.
CodePudding user response:
In base R:
table(tapply(my_data$year, my_data$id,
function(x) paste(sort(unique(x)), collapse=',')))