Home > Software design >  Find duplicates in a df and count them separatedly with R
Find duplicates in a df and count them separatedly with R

Time:07-24

I have a dataframe as it follows

Id   |  Code  | year
1    |  ZZZ   | 2016
1    |  KKK   | 2016
1    |  A23   | 2018
2    |  A01   | 2018
2    |  KKK   | 2016
2    |  ddd   | 2017
3    |  KKK   | 2016
3    |  ZZZ   | 2016
4    |  A23   | 2018
4    |  000   | 2018
5    |  009   | 2018

What I need is a table where for each id the table has a count of how many codes has at least a duplicated value in the df, separated by year. This should be an example af the output based on the df shown above.

Id  |  2016  |  2017  |  2018  | 
1   |  2     |   0    |   1    | 
2   |  1     |   0    |   0    |
3   |  2     |   0    |   0    | 
4   |  0     |   0    |   1    | 

CodePudding user response:

dat <- structure(list(Id = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5), Code = c("ZZZ", 
"KKK", "A23", "A01", "KKK", "ddd", "KKK", "ZZZ", "A23", "000", 
"009"), year = c(2016L, 2016L, 2018L, 2018L, 2016L, 2017L, 2016L, 
2016L, 2018L, 2018L, 2018L)), row.names = c(NA, -11L), class = "data.frame")

You can use:

## this helps to keep 2017 column
dat$Year <- as.factor(dat$year)
## keep Code that appears more than once then form contingency table
table(subset(dat, duplicated(Code) | duplicated(Code, fromLast = T), c(Id, Year)))
#   Year
#Id  2016 2017 2018
#  1    2    0    1
#  2    1    0    0
#  3    2    0    0
#  4    0    0    1

CodePudding user response:

You could use xtabs function:

dat$year = factor(dat$year)
# tabulate Id and Year for duplicated Code values
xtabs( ~Id   year, dat, subset=duplicated(Code) | duplicated(Code, fromLast=TRUE))

   year
Id  2016 2017 2018
  1    2    0    1
  2    1    0    0
  3    2    0    0
  4    0    0    1
  • Related