My pandas series contains year
values. They're not formatted consistently. For example,
df['year']
1994-1996
circa 1990
1995-1998
circa 2010
I'd like to grab the year
from the string
.
df['Year'] = df['Year'].astype(str)
df['Year'] = df['Year'].str[:4]
This doesn't work for rows with circa
.
I'd like handle the rows with circa
and grab only the year if it exists.
df['Year']
1994
1990
1995
2010
CodePudding user response:
df['Year_Only']=df['Year'].str.extract(r'(\d{4})')[:4]
CodePudding user response:
You can use str.extract
then convert as pd.Int16Dtype
:
df['Year'] = df['Year'].str.extract(r'(\d{4})', expand=False).astype(pd.Int16Dtype())
print(df)
# Output
Year
0 1994
1 1990
2 1995
3 2010