I have a dataframe in R with many rows that are duplicate:
header1 | header2 |
---|---|
tuna | apple |
orange | salmon |
orange | trout |
blue | trout |
orange | salmon |
tuna | apple |
As you can see, rows 1 & 6 and 2 & 5 are exact duplicates of each other.
I would like to remove the duplicates, but create a third column where it lists the number of occurrences:
header1 | header2 | NEW: Number of Occurances |
---|---|---|
tuna | apple | 2 |
orange | salmon | 2 |
orange | trout | 1 |
blue | trout | 1 |
Does anyone have an idea for how I can do this? I'm really stuck. Any help is much appreciated.
---------
[I did take multiple shots at it. I don't think they're close to being correct, but here's what I was trying]:
countx = 1
county = 2
while (county <= 6){
if ((df[countx,] == df[county,]) == T & T){
print("true")
} else {
print ("false")
}
county <- county 1
}
However this will return "true" even if only one of the columns matches another one – for instance, "orange" in row 2 and "orange" in row 3, even though the other cell in the row doesn't match.
I was also trying an if else, but don't really know where to take it:
if(duplicated(df[1,])==T){
xx
}else{
xx
}
As you can tell, I'm a self-taught newbie trying to use R for my thesis and getting a bit in over my head. Would really appreciate any help! TIA!
CodePudding user response:
A possible solution:
library(dplyr)
df %>%
group_by(header1, header2) %>%
summarise(n = n(), .groups = "drop")
#> # A tibble: 4 × 3
#> header1 header2 n
#> <chr> <chr> <int>
#> 1 blue trout 1
#> 2 orange salmon 2
#> 3 orange trout 1
#> 4 tuna apple 2
CodePudding user response:
The easiest is to simply use count
from dplyr
:
library(dplyr)
df %>%
count(header1, header2)
Output
header1 header2 n
1 blue trout 1
2 orange salmon 2
3 orange trout 1
4 tuna apple 2
Or with tally
:
df %>%
group_by(header1, header2) %>%
tally() %>%
ungroup
Or another option with data.table
:
library(data.table)
dt <- as.data.table(df)
dt[, list(count =.N), by=list(header1, header2)]
Or you can use ddply
from plyr
:
plyr::ddply(df, c("header1", "header2"), nrow)