I tried to extract birthyear and deathyear from a short text looking like that in a column in a Pandas DataFrame :
firstname lastname (1937-2015)
I used this code to get the first year:
data = re.findall(r'\d ', txt)
if len(data) > 0 :
data = float(data[0])
if data >= 1800 and data <= 2021 :
return data
return None
but I can't extract the second year from the text.
When I change data[0]
into data[1]
for example, I have an error message "list index out of range"
CodePudding user response:
A generic regex solution to extract the second year (from 1800 to 2099) in Pandas using Series.str.extract
you can leverage
import pandas as pd
df = pd.DataFrame({'col':['firstname lastname (1937-2015)']})
yr = r'(?:1[89][0-9]{2}|20[01][0-9]|202[01])'
df['second_year'] = df['col'].str.extract(fr'(?s)(?<!\d){yr}(?!\d).*?({yr})(?!\d)')
# => df['second_year']
# 0 2015
# Name: second_year, dtype: object
See the regex demo. Details:
(?s)
-.
now matches across lines(?<!\d)
- a left-hand numeric boundary(?:1[89][0-9]{2}|20[01][0-9]|202[01])
- from 1800 to 2021(?!\d)
- a right hand numeric boundary.*?
- any text, as few chars as possible(1[89][0-9]{2}|20[01][0-9]|202[01])
- Group 1 (the actual return result ofSeries.str.extract
): 1800 to 2021(?!\d)
- a right hand numeric boundary
In this concrete case, a simple
df['second_year'] = df['col'].str.extract(r'.*-(\d{4})')
will be enough: any text (as many chars other than line break chars as possible) and then a -
and four digits captured into Group 1.
See this regex demo.
CodePudding user response:
You can define 2 capture groups and check them
df = pd.DataFrame(
{'txt': ['firstname lastname (1937-2015)', 'firstname lastname (1780-1820)',
'firstname lastname 1900', 'firstname lastname (1980-2022)']})
df[['birth', 'death']] = df['txt'].str.extract(r'(\d )-(\d )').astype(float). \
applymap(lambda x: x if 1800 <= x <= 2021 else None)
print(df)
Output:
txt birth death
0 firstname lastname (1937-2015) 1937.0 2015.0
1 firstname lastname (1780-1820) NaN 1820.0
2 firstname lastname 1900 NaN NaN
3 firstname lastname (1980-2022) 1980.0 NaN
CodePudding user response:
use a regular expression to find the year from and to sub phrase then split it and index the second year. you can use this in a dataframe apply for assignment to a column
txt="firstname lastname (1937-2015)"
pattern='(\d{4}\-\d{4}) '
matches=re.findall(pattern,txt)
print(matches[0].split('-')[1])
output
2015