Home > OS >  Extract date from string in a pandas dataframe column
Extract date from string in a pandas dataframe column

Time:11-18

I am trying to extract date from a DF column containing strings and store in another column.

from dateutil.parser import parse
 
extract = parse("January 24, 1976", fuzzy_with_tokens=True)
print(str(extract[0]))

The above code extracts: 1976-01-24 00:00:00

I would like this to be done to all strings in a column in a DF.

The below is what I am trying but is not working:

df['Dates'] = df.apply(lambda x: parse(x['Column to extract'], fuzzy_with_tokens=True), axis=1)

Things to note:

  1. If there are multiple dates, need to join them with some delimiter
  2. There can be strings without date. In that case parser returns an error "ParserError: String does not contain a date". This needs to be handled.

CodePudding user response:

See ents

We can apply the spacy logic to a dataframe

import pandas as pd  # 1.5.1


# some fake data
df = pd.DataFrame({
    "text": ["today is january 26, 2016. Tomorrow is january 27, 2016",
             "today is january 26, 2016.",
              "Tomorrow is january 27, 2016"]
})

# convert text to spacy docs
docs = nlp.pipe(df.text.to_numpy())

# unpack the generator into a series
doc_series = pd.Series(docs, index=df.index, name="docs")

df = df.join(doc_series)

# extract entities
df["entities"] = df.docs.apply(lambda x: x.ents)

# explode to one entity per row
df = df.explode(column="entities")

# build dictionary of ent type and ent text
df["entities"] = df.entities.apply(lambda ent: {ent.label_: ent.text})

# join back with df
df = df.join(df["entities"].apply(pd.Series))

# convert all DATE entities to datetime
df["dates"] = pd.to_datetime(df.DATE, errors="coerce")

# back to one row per original text and a container of datetimes
df = df.groupby("text").dates.unique().to_frame().reset_index()

print(df)
                                                text                                              dates
0                       Tomorrow is january 27, 2016               [NaT, 2016-01-27T00:00:00.000000000]
1                         today is january 26, 2016.  [2022-11-17T11:42:49.607705000, 2016-01-26T00:...
2  today is january 26, 2016. Tomorrow is january...  [2022-11-17T11:42:49.605705000, 2016-01-26T00:...

CodePudding user response:

If you want to use parse, you may need a customized function to handle exceptions:

def parse_date(row):
    try:
        date = parse(row, fuzzy_with_tokens=True)
        return date[0]
    except:
        return np.nan


df['dates'] = df['Column to extract'].apply(lambda x: parse_date(x))
  • Related