I am trying to extract date from a DF column containing strings and store in another column.
from dateutil.parser import parse
extract = parse("January 24, 1976", fuzzy_with_tokens=True)
print(str(extract[0]))
The above code extracts: 1976-01-24 00:00:00
I would like this to be done to all strings in a column in a DF.
The below is what I am trying but is not working:
df['Dates'] = df.apply(lambda x: parse(x['Column to extract'], fuzzy_with_tokens=True), axis=1)
Things to note:
- If there are multiple dates, need to join them with some delimiter
- There can be strings without date. In that case parser returns an error "ParserError: String does not contain a date". This needs to be handled.
CodePudding user response:
We can apply the spacy
logic to a dataframe
import pandas as pd # 1.5.1
# some fake data
df = pd.DataFrame({
"text": ["today is january 26, 2016. Tomorrow is january 27, 2016",
"today is january 26, 2016.",
"Tomorrow is january 27, 2016"]
})
# convert text to spacy docs
docs = nlp.pipe(df.text.to_numpy())
# unpack the generator into a series
doc_series = pd.Series(docs, index=df.index, name="docs")
df = df.join(doc_series)
# extract entities
df["entities"] = df.docs.apply(lambda x: x.ents)
# explode to one entity per row
df = df.explode(column="entities")
# build dictionary of ent type and ent text
df["entities"] = df.entities.apply(lambda ent: {ent.label_: ent.text})
# join back with df
df = df.join(df["entities"].apply(pd.Series))
# convert all DATE entities to datetime
df["dates"] = pd.to_datetime(df.DATE, errors="coerce")
# back to one row per original text and a container of datetimes
df = df.groupby("text").dates.unique().to_frame().reset_index()
print(df)
text dates
0 Tomorrow is january 27, 2016 [NaT, 2016-01-27T00:00:00.000000000]
1 today is january 26, 2016. [2022-11-17T11:42:49.607705000, 2016-01-26T00:...
2 today is january 26, 2016. Tomorrow is january... [2022-11-17T11:42:49.605705000, 2016-01-26T00:...
CodePudding user response:
If you want to use parse
, you may need a customized function to handle exceptions:
def parse_date(row):
try:
date = parse(row, fuzzy_with_tokens=True)
return date[0]
except:
return np.nan
df['dates'] = df['Column to extract'].apply(lambda x: parse_date(x))