Home > database >  Unique keywords on the column
Unique keywords on the column

Time:11-02

I'm new to pandas and I have a question.

I have a dataframe like

Code    Keywords
A       Real estate, loan, building, office, land, warehouse
B       Real Estate Lease , Real Estate, building, Office, Warehouse, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent
C       Transport Air freight, shift, cargo, truck, insurance, Transport Insurance, Transport
D       Transport, shift, cargo, truck, insurance, Transport Insurance

and I should remove duplicates on "Keywords" column, no matter if the duplicates are on the same row or on 3 different rows. No matter if it is written "warehouse" or "Warehouse" Everything value duplicated is removed

The result should look like this:

Code    Keywords
A       loan, land
B       Real Estate Lease, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent
C       Transport Air freight
D   

For instance, column "D" will not have keywords at all, because all of them have duplicates on other rows

Thank you

CodePudding user response:

One way using pandas.Series.str.split with explode:

m = df["Keywords"].str.split("\s*,\s*").explode()
m = m[~m.str.lower().duplicated(False)]
df["Keywords"] = m.groupby(m.index).apply(", ".join)
df = df.fillna("")

Output:

  Code                                        Keywords
0    A                                      loan, land
1    B  rental, Tenant, broker advisor, Lease and rent
2    C                           Transport Air freight
3    D                                                
  • Related