I have a Department column in pandas data frame as following
Date | Department |
---|---|
Friday, 1 April 2022 | S220- Department of Transport |
Thursday, 26 August 2021 | S220 Department of Transport |
Friday, 1 April 2022 | S221- Department of Land, Water, Planning |
Thursday, 26 August 2021 | S221 Department of Land, Water, Planning |
Since , in the source data, the department column is erroneous, for example, while S220- Department of Transport and S220 Department of Transport refers to the same department, when I pivot this data I get two Department of Transport columns. I expect one Department of transport columns. Currently I am using find and replace , but there are hundreds of agencies.
How Can I get the data in the following format
Date | Department |
---|---|
Friday, 1 April 2022 | Department of Transport |
Thursday, 26 August 2021 | Department of Transport |
Friday, 1 April 2022 | Department of Land, Water, Planning |
Thursday, 26 August 2021 | Department of Land, Water, Planning |
The string format should start at the D and take in all including ',' s to the right of D I appreciate your kind suggestions.
CodePudding user response:
A possible solution is to iterate on each row of department and remove leading undesirable characters as bellow. This is not very scalable but it might be sufficient
import pandas as pd
dep = ["S220- Department of Transport", "S220 Department of Transport" ,"S221- Department of Land, Water, Planning", "S221 Department of Land, Water, Planning"]
df = pd.DataFrame({"Department": dep})
df["Department"] = df["Department"].apply(lambda x: "".join(c for c in x[1:] if not c.isdigit() and c != "-").strip())
print(df.head())
Department
0 Department of Transport
1 Department of Transport
2 Department of Land, Water, Planning
3 Department of Land, Water, Planning
CodePudding user response:
here is one way to do it to get
The string format should start at the D and take in all including ',' s to the right of D
df['Department'] = df['Department'].str.extract(r'(D.*)')
df
Date Department
0 Friday, 1 April 2022 Department of Transport
1 Thursday, 26 August 2021 Department of Transport
2 Friday, 1 April 2022 Department of Land, Water, Planning
3 Thursday, 26 August 2021 Department of Land, Water, Planning