I have a huge data frame that has a column that contains a date and location together and I want to extract just the year from this column. The problem is that it doesn't follow a pattern and I still couldn't figure out a way to do that. Here's a sample of the three different patterns I found in this table:
col
February 28, 2020 (United States)
April 1990 (United States)
1981 (United States)
Ideal output is:
col yearcorrect
February 28, 2020 (United States) 2020
April 1990 (United States) 1990
1981 (United States) 1981
I managed to get the first and the third pattern right by doing this:
df['yearcorrect'] = df['col'].astype(str).str.split(', ').str[-1].astype(str).str[:4]
But I still have the problem with the middle pattern because when I do it, it returns "Apri". Any idea on how to get only the year of the 'col' and save it in a 'yearcorrect' column?
CodePudding user response:
I think a short regex would be more appropriate here:
df['yearcorrect'] = df['col'].str.extract(r'(\d{4})')
or to match before the parenthesis:
df['yearcorrect'] = df['col'].str.extract(r'(\d{4})\s*\(')
output:
col yearcorrect
0 February 28, 2020 (United States) 2020
1 April 1990 (United States) 1990
2 1981 (United States) 1981