How to correctly split a column that doesn't follow a pattern in Pandas?-CodePudding

I have a huge data frame that has a column that contains a date and location together and I want to extract just the year from this column. The problem is that it doesn't follow a pattern and I still couldn't figure out a way to do that. Here's a sample of the three different patterns I found in this table:

col
February 28, 2020 (United States)
April 1990 (United States)
1981 (United States)

Ideal output is:

col                                         yearcorrect
February 28, 2020 (United States)           2020
April 1990 (United States)                  1990
1981 (United States)                        1981

I managed to get the first and the third pattern right by doing this:

df['yearcorrect'] = df['col'].astype(str).str.split(', ').str[-1].astype(str).str[:4]

But I still have the problem with the middle pattern because when I do it, it returns "Apri". Any idea on how to get only the year of the 'col' and save it in a 'yearcorrect' column?

CodePudding user response：

I think a short regex would be more appropriate here:

df['yearcorrect'] = df['col'].str.extract(r'(\d{4})')

or to match before the parenthesis:

df['yearcorrect'] = df['col'].str.extract(r'(\d{4})\s*\(')

output:

                                 col yearcorrect
0  February 28, 2020 (United States)        2020
1         April 1990 (United States)        1990
2               1981 (United States)        1981