Home > Blockchain >  Dataframe Check in each columns if anything is repeating, create new column and add. Remove unnecess
Dataframe Check in each columns if anything is repeating, create new column and add. Remove unnecess

Time:08-18

In the first row Zip is repeating in same so remove repeated data and in content column remove "-" and then match If anything is repeating then add in 2nd column like zip -> zip2

Take care about EMPTY value like id 4 column Contact

id  Zip          Contact     
1   12345,12345  201-201-2012, 2012012012
2   67890        801-801-8012, 456-456-4567
3   12345,67890  7897897897
4   89898

Output

id Zip    Zip2   Contact      Contact2
1  12345         2012012012
2  67890         8018018012   4564564567
3  12345  67890  7897897897
4  89898

CodePudding user response:

You can do like below and it should work with any number of zip, contact values.

df_zip = (
    df["Zip"]
    .apply(lambda x: ",".join(list(OrderedDict.fromkeys(x.split(",")))))
    .str.split(",", expand=True)
)
df_zip.columns = df_zip.columns.map("zip_{}".format)

df_contact = df["Contact"].astype(str).str.replace("-", "")
df_contact = df_contact.apply(
    lambda x: ",".join(list(OrderedDict.fromkeys([e.strip() for e in x.split(",")])))
).str.split(",", expand=True)
df_contact.columns = df_contact.columns.map("Contact_{}".format)

out = pd.concat([df["id"], df_zip, df_contact], axis=1).fillna("")

print(out):

   id  zip_0  zip_1   Contact_0   Contact_1
0   1  12345         2012012012            
1   2  67890         8018018012  4564564567
2   3  12345  67890  7897897897            
3   4  89898    

You see I used OrderedDict to preserve the order of the keys

  • Related