Removing unwanted strings, signs and space from a pandas dataframe list-CodePudding

I have a Department column in pandas data frame as following

Date	Department
Friday, 1 April 2022	S220- Department of Transport
Thursday, 26 August 2021	S220 Department of Transport
Friday, 1 April 2022	S221- Department of Land, Water, Planning
Thursday, 26 August 2021	S221 Department of Land, Water, Planning

Since , in the source data, the department column is erroneous, for example, while S220- Department of Transport and S220 Department of Transport refers to the same department, when I pivot this data I get two Department of Transport columns. I expect one Department of transport columns. Currently I am using find and replace , but there are hundreds of agencies.

How Can I get the data in the following format

Date	Department
Friday, 1 April 2022	Department of Transport
Thursday, 26 August 2021	Department of Transport
Friday, 1 April 2022	Department of Land, Water, Planning
Thursday, 26 August 2021	Department of Land, Water, Planning

The string format should start at the D and take in all including ',' s to the right of D I appreciate your kind suggestions.

CodePudding user response：

A possible solution is to iterate on each row of department and remove leading undesirable characters as bellow. This is not very scalable but it might be sufficient

import pandas as pd
dep = ["S220- Department of Transport", "S220 Department of Transport" ,"S221- Department of Land, Water, Planning", "S221 Department of Land, Water, Planning"]
df = pd.DataFrame({"Department": dep})

df["Department"] = df["Department"].apply(lambda x: "".join(c for c in x[1:] if not c.isdigit() and c != "-").strip())

print(df.head())
                            Department
0              Department of Transport
1              Department of Transport
2  Department of Land, Water, Planning
3  Department of Land, Water, Planning

CodePudding user response：

here is one way to do it to get

The string format should start at the D and take in all including ',' s to the right of D

df['Department'] = df['Department'].str.extract(r'(D.*)')
df

    Date                        Department
0   Friday, 1 April 2022        Department of Transport
1   Thursday, 26 August 2021    Department of Transport
2   Friday, 1 April 2022        Department of Land, Water, Planning
3   Thursday, 26 August 2021    Department of Land, Water, Planning