Compare 2 pandas dataframe with different lengths and get the column name which has changed value-CodePudding

I have 2 dataframes of unequal length. Some records can be present in both and some might be only be there in 1 dataframe.

data = [['1001', 4573, 'Test1'], ['1002', 4574, 'Test2'], ['1003', 4575, 'Test3']]
 
df1 = pd.DataFrame(data, columns = ['id', 'addressid','address'])

data = [['1001', 4573, 'Test1'], ['1002', 4574, 'Test22']]
 
df2 = pd.DataFrame(data, columns = ['id', 'addressid','address'])

I need a best approach to get the difference column name. ie, here id 1002's address changed from Address2 to Address22, so I need the column name to be returned as output.

Expected Output : [Address]

CodePudding user response：

If you set the id column as the index of your two dataframes you can compare values (like a merge):

# set id as index
df1 = df1.set_index('id')
df2 = df2.set_index('id')

# find inner join of index and columns
rows = df1.index.intersection(df2.index)
cols = df1.columns.intersection(df2.columns)

# get columns where at least one value is different
out = df2.loc[rows, cols].ne(df1.loc[rows, cols]).any() \
         .loc[lambda x: x].index.tolist()
print(out)

# Output
['address']

CodePudding user response：

Here is a generalized approach that is not specific to columns names, assumes multiple columns may have changed, and multiple records may have changed:

Note that I set "id" as the index, so that I've have some identity basis on which to compare records that would be expected to be the same entity.

This does not just print a single column name that changed for a single record, but will generate a dataframe/table that contains all the records that had been changed, and shows the column name (or list of names) that changed.

data = [['1001', 4573, 'Test1'], ['1002', 4574, 'Test2'], ['1003', 4575, 'Test3']]
df1 = pd.DataFrame(data, columns = ['id', 'addressid','address'])
data = [['1001', 4573, 'Test1'], ['1002', 4574, 'Test22']]
df2 = pd.DataFrame(data, columns = ['id', 'addressid','address'])


def getColChangedName(row):
    colsChanged = []
    for c in df1.columns.values:
        if row[c "_diff"] == True:
            colsChanged.append(c)
    return ", ".join(colsChanged)


df1 = df1.set_index("id")
df2 = df2.set_index("id")

diffs = df1.merge(df2, left_index=True, right_index=True, suffixes=('_old', '_new'))

diff_cols = [c "_diff" for c in df1.columns.values]

for c in df1.columns.values:
    diffs[c "_diff"] = diffs[c "_old"] != diffs[c "_new"]
    
diffs["Record_Changed"] = diffs[diff_cols].sum(axis=1)/len(diff_cols) > 0


diffs = diffs[diffs["Record_Changed"] == 1]
diffs["Cols_Changed"] = diffs.apply(lambda row: getColChangedName(row), axis=1)

diffs.head()

The resulting diffs DataFrame looks like this on the test data:

And, more simply, if you just want the list of columns changed for each record:

print(diffs[["Cols_Changed"]])

     Cols_Changed
id               
1002      address