Home > database >  Remove duplicate values from a column of json objects in Pandas Dataframe?
Remove duplicate values from a column of json objects in Pandas Dataframe?

Time:09-28

I am trying to remove the duplicates in the previous_location column of which is currently a json object. I'd like to leave the row as a json object.

initialize list of lists

data = [['tom', 20, [{"location":"USA", "State":"CA"}, {"location":"USA", "State":"CA"}, {"location":"USA", "State":"TX"}]], 
        ['nick', 35, [{"location":"USA", "State":"PA"}, {"location":"USA", "State":"PA"}, {"location":"USA", "State":"ME"}]], 
        ['julie', 29, [{"location":"USA", "State":"WA"}, {"location":"USA", "State":"WA"}, {"location":"USA", "State":"HI"}]]]

Create the pandas DataFrame

df = pd.DataFrame(data, columns=['name', 'age', 'previous_location'])

print dataframe.

print(df)
name age previous_location
tom 20 [{"State": "CA", "location": "USA"}, {"State": "CA", "location": "USA"}, {"State": "TX", "location": "USA"}]
nick 35 [{"State": "PA", "location": "USA"}, {"State": "PA", "location": "USA"}, {"State": "ME", "location": "USA"}]
julie 29 [{"State": "WA", "location": "USA"}, {"State": "WA", "location": "USA"}, {"State": "HI", "location": "USA"}]

expected output

name age previous_location
tom 20 [{"State": "CA", "location": "USA"}, {"State": "TX", "location": "USA"}]
nick 35 [{"State": "PA", "location": "USA"}, {"State": "ME", "location": "USA"}]
julie 29 [{"State": "WA", "location": "USA"}, {"State": "HI", "location": "USA"}]

CodePudding user response:

Try:

df["previous_location"] = df["previous_location"].apply(
    lambda x: [dict(d) for d in set(tuple(sorted(d.items())) for d in x)]
)

print(df)

Prints:

    name  age                                                         previous_location
0    tom   20  [{'State': 'CA', 'location': 'USA'}, {'State': 'TX', 'location': 'USA'}]
1   nick   35  [{'State': 'ME', 'location': 'USA'}, {'State': 'PA', 'location': 'USA'}]
2  julie   29  [{'State': 'WA', 'location': 'USA'}, {'State': 'HI', 'location': 'USA'}]

EDIT:

def fn(x):
    out, seen = [], set()
    for dct in x:
        t = tuple(sorted(dct.items()))
        if t not in seen:
            out.append(dct)
            seen.add(t)
    return out


df["previous_location"] = df["previous_location"].apply(fn)

print(df)

Prints:

    name  age                                                         previous_location
0    tom   20  [{'location': 'USA', 'State': 'CA'}, {'location': 'USA', 'State': 'TX'}]
1   nick   35  [{'location': 'USA', 'State': 'PA'}, {'location': 'USA', 'State': 'ME'}]
2  julie   29  [{'location': 'USA', 'State': 'WA'}, {'location': 'USA', 'State': 'HI'}]
  • Related