Dropping unnecessary text from data using regex and applying it entire dataframe-CodePudding

i have table where it has dates in multiple formats. with that it also has some unwanted text which i want to drop so that i could process this date strings

Data :

sr.no.           col_1                                col_2
1              'xper may 2022 - nov 2022'          'derh 06/2022 - 07/2022 ubj'
2              'sp@ 2021 - 2022'                   'zpt May 2022 - December 2022'

Expected Output :

sr.no.           col_1                                col_2
1              'may 2022 - nov 2022'           '06/2022 - 07/2022'
2              '2021 - 2022'                   'May 2022 - December 2022'

def keep_valid_characters(string):
    return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)

i am using the above pattern to drop but stuck. any other approach.?

CodePudding user response：

You can try to split the pattern construction to multiple strings in complicated case like this:

months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/] \s*-\s*(?:{months})?\s*[\d/] )"

df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)

Prints:

   sr.no.                col_1                     col_2
0       1  may 2022 - nov 2022         06/2022 - 07/2022
1       2          2021 - 2022  May 2022 - December 2022