Home > OS >  Method to remove unwanted date characters in pandas dataframe
Method to remove unwanted date characters in pandas dataframe

Time:11-22

I have collected some data in a pandas dataframe. The dates column is not at all useful right now. I was able to remove square brackets with various of unimportant information with:

df['Date'] = df['Date'].str.replace(r"\(.*\)","")

However, there is still a lot of information i dont want. I just need date, month and year. See example of output from dataframe[Date]:

14 September 1932 14:40

september 6, 1943 6:06 p.m.

15 november 1991 about 16:00 local time

July 9 1981 01:47 CST 17:47 UTC

24 may 1847 ~06:50 am

I thought about trying to convert month into number and then specify that subsequent to say 8 characters, dd/mm/yyyy, then delete everything. I'm just not sure how to convert every month to an integer, and then delete everything after 8 characters. Also, the "8 characters" approach might not be the best given that not all dates are given by dd. Any better suggestions?

Thank you for your time.

CodePudding user response:

I would suggest the following function that tries to convert string to pd.datetime object and if the attempt failed cut off the last word and try again.

def extract_date(text):
    while len(text) > 0:
        try:
            res = pd.to_datetime(text)
            break
        except:
            text = ' '.join(text.split()[:-1]) # drop the last "word"
    else:
        return pd.NaT
    return res.date()

You can apply the function to your Series like this:

df['Time'] = df['Time'].apply(extract_date)

But please be aware that this is not 100% reliable because excessive elements (not related to the date) could appear at the beginning of your texts. And also, the order of day and month can vary, which may affect the results.

CodePudding user response:

Depending on how willing you are to risk a few inaccuracies, these patterns correctly extract the year, month, and day for all the examples given.

year = df['Date'].str.extract(r'\b(\d{4})\b')
month = df['Date'].str.extract(r'([a-zA-Z] )')
day = df['Date'].str.extract(r'\b(\d{1,2})\b')
  • Related