Home > OS >  Series split column with condition
Series split column with condition

Time:07-20

My pandas series contains year values. They're not formatted consistently. For example,

df['year']

1994-1996
circa 1990
1995-1998
circa 2010

I'd like to grab the year from the string.

df['Year'] = df['Year'].astype(str)
df['Year'] = df['Year'].str[:4]

This doesn't work for rows with circa.

I'd like handle the rows with circa and grab only the year if it exists.

df['Year'] 

1994
1990
1995
2010

CodePudding user response:

df['Year_Only']=df['Year'].str.extract(r'(\d{4})')[:4]

CodePudding user response:

You can use str.extract then convert as pd.Int16Dtype:

df['Year'] = df['Year'].str.extract(r'(\d{4})', expand=False).astype(pd.Int16Dtype())
print(df)

# Output
   Year
0  1994
1  1990
2  1995
3  2010
  • Related