I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.
I need to identify the columns where the values are different, and then subset those columns.
This would be a sample of my dataset:
library(data.table)
dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9 -52.7 -48.2 LT12"
dat <- setDT(read.table(textConnection(dat), header=T))
And this is the output I would expect:
observationRep observationVal setSource
1: 1 17.6 XF47
2: 2 1.9 LT12
One detail is: my original dataset has 189 columns, so I need to check all of them.
How to achieve this?
CodePudding user response:
Two issues, first, use text=
argument rather than textConnection
, second, use as.data.table
, since seDT
modifies object in place, but it yet isn't there.
dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
# observationRep observationVal setSource
# 1: 1 17.6 XF47
# 2: 2 1.9 LT12
CodePudding user response:
I think your code would look something like this.
import pandas as pd
# load the dataframe
df = pd.read_csv("data.csv")
# find the duplicate records
duplicates = df[df.duplicated()]
# subset the dataframe to only include the columns where the values are different
differences = df[duplicates.columns].loc[duplicates]
# print the resulting dataframe
print(differences)