Show index of rows with equal values in two columns and different value in a third column-CodePudding

Assuming that I have a df like this:

import pandas as pd


data = {"name":  ["nameA", "nameB", "nameC", "nameD", "nameA", "nameB", "nameC", "nameD", "nameE", "nameA", "nameC", "nameD"],
        "id": [100, 200, 300, 400, 100, 200, 300, 400, 500, 100, 300, 400],
        "value": ["v1", "v2", "v3", "v4", "v1", "v2", "v105", "v107", "v5", "v1", "v3615", "v4"]}

df = pd.DataFrame(data)
df

I want to group the rows if they are duplicated values in name and id and different values in the value column. The idea is that if name and id are the same, they must have same value which is not always true and I want to extract the offending indexes. For example, I want to have an output like this:

(2, 6, 10), (3, 7, 11)

where the name and ID are the same but the value on one of the duplicated rows is different, like ("v4", "v107", "v4").

I can group the values with duplicated but I am getting a list of all duplicated values

df[df.duplicated(["name", "id"])]

This results in

The values under index 4 and 9 are the same so I don't care about this one.

I guess this is really simple but I am just stuck at this for at least an hour and can't figure it out.

CodePudding user response：

Try with groupby and filter

out = df.groupby(["name", "id"]).filter(lambda x : x['value'].nunique() == len(x))

CodePudding user response：

Use groupby() agg() nunique() to extract the lists of indexes where the group has unique values >= 2, as follows:

idx_lst_all = df.groupby(['name', 'id'])['value'].agg(lambda x : x.index[x.nunique() >=2]).values

# convert non-empty lists to tuples
idx_lst_selected = [tuple(x[0]) for x in idx_lst_all if x.size > 0]

Result:

print(idx_lst_selected)

[(2, 6, 10), (3, 7, 11)]