How to extract 2d year from text in pyhton?-CodePudding

I tried to extract birthyear and deathyear from a short text looking like that in a column in a Pandas DataFrame :

firstname lastname (1937-2015)

I used this code to get the first year:

data = re.findall(r'\d ', txt)
if len(data) > 0 :
    data = float(data[0])
    if data >= 1800 and data <= 2021 :
        return data
return None

but I can't extract the second year from the text. When I change data[0] into data[1] for example, I have an error message "list index out of range"

CodePudding user response：

A generic regex solution to extract the second year (from 1800 to 2099) in Pandas using Series.str.extract you can leverage

import pandas as pd
df = pd.DataFrame({'col':['firstname lastname (1937-2015)']})
yr = r'(?:1[89][0-9]{2}|20[01][0-9]|202[01])'
df['second_year'] = df['col'].str.extract(fr'(?s)(?<!\d){yr}(?!\d).*?({yr})(?!\d)')
# => df['second_year']
#   0    2015
#   Name: second_year, dtype: object

See the regex demo. Details:

(?s) - . now matches across lines
(?<!\d) - a left-hand numeric boundary
(?:1[89][0-9]{2}|20[01][0-9]|202[01]) - from 1800 to 2021
(?!\d) - a right hand numeric boundary
.*? - any text, as few chars as possible
(1[89][0-9]{2}|20[01][0-9]|202[01]) - Group 1 (the actual return result of Series.str.extract): 1800 to 2021
(?!\d) - a right hand numeric boundary

In this concrete case, a simple

df['second_year'] = df['col'].str.extract(r'.*-(\d{4})')

will be enough: any text (as many chars other than line break chars as possible) and then a - and four digits captured into Group 1.

See this regex demo.

CodePudding user response：

You can define 2 capture groups and check them

df = pd.DataFrame(
    {'txt': ['firstname lastname (1937-2015)', 'firstname lastname (1780-1820)',
             'firstname lastname 1900', 'firstname lastname (1980-2022)']})

df[['birth', 'death']] = df['txt'].str.extract(r'(\d )-(\d )').astype(float). \
    applymap(lambda x: x if 1800 <= x <= 2021 else None)
print(df)

Output:

                              txt   birth   death
0  firstname lastname (1937-2015)  1937.0  2015.0
1  firstname lastname (1780-1820)     NaN  1820.0
2         firstname lastname 1900     NaN     NaN
3  firstname lastname (1980-2022)  1980.0     NaN

CodePudding user response：

use a regular expression to find the year from and to sub phrase then split it and index the second year. you can use this in a dataframe apply for assignment to a column

txt="firstname lastname (1937-2015)"
pattern='(\d{4}\-\d{4}) '

matches=re.findall(pattern,txt)
print(matches[0].split('-')[1])

output