Identify duplicated rows based on two columns and reassign-CodePudding

So I have the following data frame with columns "Longitude", "Latitude" and "Prob", for probability. Prob is always either 0, 1 or 2 (note you can't see a row with Prob=2 in the example below).

    Longitude  Latitude Prob
1    28.00403 -26.50431    1
2    28.01236 -26.50431    1
3    28.02069 -26.50431    1
4    28.02903 -26.50431    0
5    28.03736 -26.50431    0
6    28.04569 -26.50431    0
7    28.05403 -26.50431    0
8    28.06236 -26.50431    1
9    28.07069 -26.50431    1

What I want to do is identify rows that have the same Longitude and Latitude values. If, for example, rows 5 and 7 show the same Longitude and Latitude values, and row 5 has Prob 1 and row 7 has probability 2, I want to remove one of those rows and make Prob equal 3 for the row that remains.

I.e., I want to select duplicated rows based on two columns, Longitude and Latitude. If two rows have the same longitude and latitude values, and if they show Prob 1 and 2, I want to "merge" these rows and make Prob 3.

I hope this makes sense. Thanks.

CodePudding user response：

If I understood you correctly, you just want to group by Long, Lat and take the sum of Prob.

df %>%
  group_by(Longitude, Latitude) %>%
  summarise(Prob = sum(Prob))

CodePudding user response：

You should try to solve the problem and post your challenges here, rather than just the starting question. This page might also help.

Anyway, something like this might work:

#fake data
df = data.frame(
  x = c(1,2,3,3,5),
  y = c(2,3,3,3,4),
  prob = rep('a',5)
)

# find duplicates
index_both = duplicated(df) | duplicated(df, fromLast=TRUE)
index_drop = duplicated(df)

# assign new value
df[index_both,'prob'] = 'b'

# drop duplicate
df = df[!index_drop,]