dataset in question Hello, I have been trying to standardize the date in the year column to get rid of the decimals and and the random format and keep only the years.
Is there an efficient way to do this in Pandas?
CodePudding user response:
Setup
import pandas as pd # 1.5.1
so = pd.DataFrame({
"Countries": [*["Canada"]*5, *["Brazil"]*5],
"Year": [1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 2020.0, 2021.0, 2021.0, "2011-21", 2021.0],
"Value": 1 # placeholder
})
print(so)
Countries Year Value
0 Canada 1990.0 1
1 Canada 1991.0 1
2 Canada 1992.0 1
3 Canada 1993.0 1
4 Canada 1994.0 1
5 Brazil 2020.0 1
6 Brazil 2021.0 1
7 Brazil 2021.0 1
8 Brazil 2011-21 1
9 Brazil 2021.0 1
Explanation
Inspecting the .dtype
of so.Year
we get object
print(so.Year.dtype)
object
I'm making an assumption that all years in so.Year
will be 4-digit, so I convert to str
and limit to the first four characters
so["NewYear"] = so.Year.astype(str).str[:4]
print(so)
Countries Year Value NewYear
0 Canada 1990.0 1 1990
1 Canada 1991.0 1 1991
2 Canada 1992.0 1 1992
3 Canada 1993.0 1 1993
4 Canada 1994.0 1 1994
5 Brazil 2020.0 1 2020
6 Brazil 2021.0 1 2021
7 Brazil 2021.0 1 2021
8 Brazil 2011-21 1 2011
9 Brazil 2021.0 1 2021
Now you can either use the NewYear
column as-is, or convert to some other dtype
.