Home > Software engineering >  Removing unwanted strings, signs and space from a pandas dataframe list
Removing unwanted strings, signs and space from a pandas dataframe list

Time:07-02

I have a Department column in pandas data frame as following

Date Department
Friday, 1 April 2022 S220- Department of Transport
Thursday, 26 August 2021 S220 Department of Transport
Friday, 1 April 2022 S221- Department of Land, Water, Planning
Thursday, 26 August 2021 S221 Department of Land, Water, Planning

Since , in the source data, the department column is erroneous, for example, while S220- Department of Transport and S220 Department of Transport refers to the same department, when I pivot this data I get two Department of Transport columns. I expect one Department of transport columns. Currently I am using find and replace , but there are hundreds of agencies.

How Can I get the data in the following format

Date Department
Friday, 1 April 2022 Department of Transport
Thursday, 26 August 2021 Department of Transport
Friday, 1 April 2022 Department of Land, Water, Planning
Thursday, 26 August 2021 Department of Land, Water, Planning

The string format should start at the D and take in all including ',' s to the right of D I appreciate your kind suggestions.

CodePudding user response:

A possible solution is to iterate on each row of department and remove leading undesirable characters as bellow. This is not very scalable but it might be sufficient

import pandas as pd
dep = ["S220- Department of Transport", "S220 Department of Transport" ,"S221- Department of Land, Water, Planning", "S221 Department of Land, Water, Planning"]
df = pd.DataFrame({"Department": dep})

df["Department"] = df["Department"].apply(lambda x: "".join(c for c in x[1:] if not c.isdigit() and c != "-").strip())

print(df.head())
                            Department
0              Department of Transport
1              Department of Transport
2  Department of Land, Water, Planning
3  Department of Land, Water, Planning

CodePudding user response:

here is one way to do it to get

The string format should start at the D and take in all including ',' s to the right of D

df['Department'] = df['Department'].str.extract(r'(D.*)')
df
    Date                        Department
0   Friday, 1 April 2022        Department of Transport
1   Thursday, 26 August 2021    Department of Transport
2   Friday, 1 April 2022        Department of Land, Water, Planning
3   Thursday, 26 August 2021    Department of Land, Water, Planning
  • Related