Home > other >  How to add column with number of occurances in dataset?
How to add column with number of occurances in dataset?

Time:06-06

I have a dataframe in R with many rows that are duplicate:

header1 header2
tuna apple
orange salmon
orange trout
blue trout
orange salmon
tuna apple

As you can see, rows 1 & 6 and 2 & 5 are exact duplicates of each other.

I would like to remove the duplicates, but create a third column where it lists the number of occurrences:

header1 header2 NEW: Number of Occurances
tuna apple 2
orange salmon 2
orange trout 1
blue trout 1

Does anyone have an idea for how I can do this? I'm really stuck. Any help is much appreciated.


---------
[I did take multiple shots at it. I don't think they're close to being correct, but here's what I was trying]:
countx = 1
county = 2

while (county <= 6){
  if ((df[countx,] == df[county,]) == T & T){
    print("true")
  } else {
    print ("false")
  }
  county <- county 1

}

However this will return "true" even if only one of the columns matches another one – for instance, "orange" in row 2 and "orange" in row 3, even though the other cell in the row doesn't match.
I was also trying an if else, but don't really know where to take it:

if(duplicated(df[1,])==T){
  xx
}else{
  xx
}

As you can tell, I'm a self-taught newbie trying to use R for my thesis and getting a bit in over my head. Would really appreciate any help! TIA!

CodePudding user response:

A possible solution:

library(dplyr)

df %>% 
  group_by(header1, header2) %>% 
  summarise(n = n(), .groups = "drop")

#> # A tibble: 4 × 3
#>   header1 header2     n
#>   <chr>   <chr>   <int>
#> 1 blue    trout       1
#> 2 orange  salmon      2
#> 3 orange  trout       1
#> 4 tuna    apple       2

CodePudding user response:

The easiest is to simply use count from dplyr:

library(dplyr)

df %>% 
  count(header1, header2)

Output

  header1 header2 n
1    blue   trout 1
2  orange  salmon 2
3  orange   trout 1
4    tuna   apple 2

Or with tally:

df %>%
  group_by(header1, header2) %>%
  tally() %>% 
  ungroup

Or another option with data.table:

library(data.table)
dt <- as.data.table(df)

dt[, list(count =.N), by=list(header1, header2)]

Or you can use ddply from plyr:

plyr::ddply(df, c("header1", "header2"), nrow)
  • Related