I am trying to remove the duplicates in the previous_location column of which is currently a json object. I'd like to leave the row as a json object.
initialize list of lists
data = [['tom', 20, [{"location":"USA", "State":"CA"}, {"location":"USA", "State":"CA"}, {"location":"USA", "State":"TX"}]],
['nick', 35, [{"location":"USA", "State":"PA"}, {"location":"USA", "State":"PA"}, {"location":"USA", "State":"ME"}]],
['julie', 29, [{"location":"USA", "State":"WA"}, {"location":"USA", "State":"WA"}, {"location":"USA", "State":"HI"}]]]
Create the pandas DataFrame
df = pd.DataFrame(data, columns=['name', 'age', 'previous_location'])
print dataframe.
print(df)
name | age | previous_location |
---|---|---|
tom | 20 | [{"State": "CA", "location": "USA"}, {"State": "CA", "location": "USA"}, {"State": "TX", "location": "USA"}] |
nick | 35 | [{"State": "PA", "location": "USA"}, {"State": "PA", "location": "USA"}, {"State": "ME", "location": "USA"}] |
julie | 29 | [{"State": "WA", "location": "USA"}, {"State": "WA", "location": "USA"}, {"State": "HI", "location": "USA"}] |
expected output
name | age | previous_location |
---|---|---|
tom | 20 | [{"State": "CA", "location": "USA"}, {"State": "TX", "location": "USA"}] |
nick | 35 | [{"State": "PA", "location": "USA"}, {"State": "ME", "location": "USA"}] |
julie | 29 | [{"State": "WA", "location": "USA"}, {"State": "HI", "location": "USA"}] |
CodePudding user response:
Try:
df["previous_location"] = df["previous_location"].apply(
lambda x: [dict(d) for d in set(tuple(sorted(d.items())) for d in x)]
)
print(df)
Prints:
name age previous_location
0 tom 20 [{'State': 'CA', 'location': 'USA'}, {'State': 'TX', 'location': 'USA'}]
1 nick 35 [{'State': 'ME', 'location': 'USA'}, {'State': 'PA', 'location': 'USA'}]
2 julie 29 [{'State': 'WA', 'location': 'USA'}, {'State': 'HI', 'location': 'USA'}]
EDIT:
def fn(x):
out, seen = [], set()
for dct in x:
t = tuple(sorted(dct.items()))
if t not in seen:
out.append(dct)
seen.add(t)
return out
df["previous_location"] = df["previous_location"].apply(fn)
print(df)
Prints:
name age previous_location
0 tom 20 [{'location': 'USA', 'State': 'CA'}, {'location': 'USA', 'State': 'TX'}]
1 nick 35 [{'location': 'USA', 'State': 'PA'}, {'location': 'USA', 'State': 'ME'}]
2 julie 29 [{'location': 'USA', 'State': 'WA'}, {'location': 'USA', 'State': 'HI'}]