In the first row Zip is repeating in same so remove repeated data and in content column remove "-" and then match If anything is repeating then add in 2nd column like zip -> zip2
Take care about EMPTY value like id 4 column Contact
id Zip Contact
1 12345,12345 201-201-2012, 2012012012
2 67890 801-801-8012, 456-456-4567
3 12345,67890 7897897897
4 89898
Output
id Zip Zip2 Contact Contact2
1 12345 2012012012
2 67890 8018018012 4564564567
3 12345 67890 7897897897
4 89898
CodePudding user response:
You can do like below and it should work with any number of zip, contact values.
df_zip = (
df["Zip"]
.apply(lambda x: ",".join(list(OrderedDict.fromkeys(x.split(",")))))
.str.split(",", expand=True)
)
df_zip.columns = df_zip.columns.map("zip_{}".format)
df_contact = df["Contact"].astype(str).str.replace("-", "")
df_contact = df_contact.apply(
lambda x: ",".join(list(OrderedDict.fromkeys([e.strip() for e in x.split(",")])))
).str.split(",", expand=True)
df_contact.columns = df_contact.columns.map("Contact_{}".format)
out = pd.concat([df["id"], df_zip, df_contact], axis=1).fillna("")
print(out):
id zip_0 zip_1 Contact_0 Contact_1
0 1 12345 2012012012
1 2 67890 8018018012 4564564567
2 3 12345 67890 7897897897
3 4 89898
You see I used OrderedDict to preserve the order of the keys