I'm new to pandas and I have a question.
I have a dataframe like
Code Keywords
A Real estate, loan, building, office, land, warehouse
B Real Estate Lease , Real Estate, building, Office, Warehouse, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent
C Transport Air freight, shift, cargo, truck, insurance, Transport Insurance, Transport
D Transport, shift, cargo, truck, insurance, Transport Insurance
and I should remove duplicates on "Keywords" column, no matter if the duplicates are on the same row or on 3 different rows. No matter if it is written "warehouse" or "Warehouse" Everything value duplicated is removed
The result should look like this:
Code Keywords
A loan, land
B Real Estate Lease, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent
C Transport Air freight
D
For instance, column "D" will not have keywords at all, because all of them have duplicates on other rows
Thank you
CodePudding user response:
One way using pandas.Series.str.split
with explode
:
m = df["Keywords"].str.split("\s*,\s*").explode()
m = m[~m.str.lower().duplicated(False)]
df["Keywords"] = m.groupby(m.index).apply(", ".join)
df = df.fillna("")
Output:
Code Keywords
0 A loan, land
1 B rental, Tenant, broker advisor, Lease and rent
2 C Transport Air freight
3 D