Find columns with different values in duplicate rows-CodePudding

I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.

I need to identify the columns where the values are different, and then subset those columns.

This would be a sample of my dataset:

library(data.table)

dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9  -52.7 -48.2 LT12"

dat <- setDT(read.table(textConnection(dat), header=T))

And this is the output I would expect:

   observationRep observationVal setSource
1:              1           17.6      XF47
2:              2            1.9      LT12

One detail is: my original dataset has 189 columns, so I need to check all of them.

How to achieve this?

CodePudding user response：

Two issues, first, use text= argument rather than textConnection, second, use as.data.table, since seDT modifies object in place, but it yet isn't there.

dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
#    observationRep observationVal setSource
# 1:              1           17.6      XF47
# 2:              2            1.9      LT12

CodePudding user response：

I think your code would look something like this.

import pandas as pd
    
# load the dataframe
df = pd.read_csv("data.csv")
    
    # find the duplicate records
    duplicates = df[df.duplicated()]
    
    # subset the dataframe to only include the columns where the values are different
    differences = df[duplicates.columns].loc[duplicates]
    
    # print the resulting dataframe
    print(differences)