Home > Software design >  Rank a dataframe based on two columns with same values in R
Rank a dataframe based on two columns with same values in R

Time:09-19

I have a database of animals that have been tested. Animals are grouped in herds. Each herd can be tested multiple times. I would like to make a new column that will tell me if it was the first time or the second that this herd has been tested.

This is an example of my database:

df <- data.frame(
        animal = c("Animal1", "Animal2", "Animal3", "Animal1", "Animal2", "Animal3", "Animal4", "Animal5", "Animal6", "Animal4", "Animal5", "Animal6"),
        herd = c("Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1","Herd2","Herd2", "Herd2","Herd2","Herd2","Herd2"),
        date = c("2017-01-01", "2017-01-01", "2017-01-01", "2018-07-01" , "2018-07-01", "2018-07-01", "2017-05-01", "2017-05-01", "2017-05-01", "2019-07-01", "2019-07-01", "2019-07-01"))

So, i would like it to look like this

     animal herd    date        testing
1   Animal1 Herd1   2017-01-01  1 
2   Animal2 Herd1   2017-01-01  1
3   Animal3 Herd1   2017-01-01  1
4   Animal1 Herd1   2018-07-01  2
5   Animal2 Herd1   2018-07-01  2
6   Animal3 Herd1   2018-07-01  2
7   Animal4 Herd2   2017-05-01  1
8   Animal5 Herd2   2017-05-01  1
9   Animal6 Herd2   2017-05-01  1
10  Animal4 Herd2   2019-07-01  2
11  Animal5 Herd2   2019-07-01  2
12  Animal6 Herd2   2019-07-01  2

I've tried this but is not exactly what I would like, and gets really messy with the whole database

df <- df %>%
  group_by(herd) %>%
  mutate(testing = rank(date))

> df
# A tibble: 12 x 4
# Groups:   herd [2]
   animal  herd  date       testing
   <fct>   <fct> <fct>        <dbl>
 1 Animal1 Herd1 2017-01-01       2
 2 Animal2 Herd1 2017-01-01       2
 3 Animal3 Herd1 2017-01-01       2
 4 Animal1 Herd1 2018-07-01       5
 5 Animal2 Herd1 2018-07-01       5
 6 Animal3 Herd1 2018-07-01       5
 7 Animal4 Herd2 2017-05-01       2
 8 Animal5 Herd2 2017-05-01       2
 9 Animal6 Herd2 2017-05-01       2
10 Animal4 Herd2 2019-07-01       5
11 Animal5 Herd2 2019-07-01       5
12 Animal6 Herd2 2019-07-01       5

Thanks for your help!

CodePudding user response:

You can use dplyr::dense_rank:

df %>% 
  group_by(herd) %>% 
  mutate(testing = dense_rank(date))

output

   animal  herd  date       testing
   <chr>   <chr> <chr>        <int>
 1 Animal1 Herd1 2017-01-01       1
 2 Animal2 Herd1 2017-01-01       1
 3 Animal3 Herd1 2017-01-01       1
 4 Animal1 Herd1 2018-07-01       2
 5 Animal2 Herd1 2018-07-01       2
 6 Animal3 Herd1 2018-07-01       2
 7 Animal4 Herd2 2017-05-01       1
 8 Animal5 Herd2 2017-05-01       1
 9 Animal6 Herd2 2017-05-01       1
10 Animal4 Herd2 2019-07-01       2
11 Animal5 Herd2 2019-07-01       2
12 Animal6 Herd2 2019-07-01       2

CodePudding user response:

A one-liner using seq in ave.

transform(df, testing=ave(date, herd, animal, FUN=seq.int))
#     animal  herd       date testing
# 1  Animal1 Herd1 2017-01-01       1
# 2  Animal2 Herd1 2017-01-01       1
# 3  Animal3 Herd1 2017-01-01       1
# 4  Animal1 Herd1 2018-07-01       2
# 5  Animal2 Herd1 2018-07-01       2
# 6  Animal3 Herd1 2018-07-01       2
# 7  Animal4 Herd2 2017-05-01       1
# 8  Animal5 Herd2 2017-05-01       1
# 9  Animal6 Herd2 2017-05-01       1
# 10 Animal4 Herd2 2019-07-01       2
# 11 Animal5 Herd2 2019-07-01       2
# 12 Animal6 Herd2 2019-07-01       2

CodePudding user response:

You could use data.table like this:

library("data.table")
setDT(df)

df[,testing := .N, by = list(animal,herd)][,testing := seq_along(testing), by = list(animal,herd)]

Or in another way:

library("data.table")
setDT(df)

df[,testing := NA][,testing := seq_along(testing), by = list(animal,herd)]

In both cases you use the data.table package, its dt[i,j,by] syntax and its chaining usage dt[first][second]

  • Related