Home > Software design >  Identify duplicated rows based on two columns and reassign
Identify duplicated rows based on two columns and reassign

Time:03-25

So I have the following data frame with columns "Longitude", "Latitude" and "Prob", for probability. Prob is always either 0, 1 or 2 (note you can't see a row with Prob=2 in the example below).

    Longitude  Latitude Prob
1    28.00403 -26.50431    1
2    28.01236 -26.50431    1
3    28.02069 -26.50431    1
4    28.02903 -26.50431    0
5    28.03736 -26.50431    0
6    28.04569 -26.50431    0
7    28.05403 -26.50431    0
8    28.06236 -26.50431    1
9    28.07069 -26.50431    1

What I want to do is identify rows that have the same Longitude and Latitude values. If, for example, rows 5 and 7 show the same Longitude and Latitude values, and row 5 has Prob 1 and row 7 has probability 2, I want to remove one of those rows and make Prob equal 3 for the row that remains.

I.e., I want to select duplicated rows based on two columns, Longitude and Latitude. If two rows have the same longitude and latitude values, and if they show Prob 1 and 2, I want to "merge" these rows and make Prob 3.

I hope this makes sense. Thanks.

CodePudding user response:

If I understood you correctly, you just want to group by Long, Lat and take the sum of Prob.

df %>%
  group_by(Longitude, Latitude) %>%
  summarise(Prob = sum(Prob))

CodePudding user response:

You should try to solve the problem and post your challenges here, rather than just the starting question. This page might also help.

Anyway, something like this might work:

#fake data
df = data.frame(
  x = c(1,2,3,3,5),
  y = c(2,3,3,3,4),
  prob = rep('a',5)
)

# find duplicates
index_both = duplicated(df) | duplicated(df, fromLast=TRUE)
index_drop = duplicated(df)

# assign new value
df[index_both,'prob'] = 'b'

# drop duplicate
df = df[!index_drop,]
  • Related