I can't find a way to add the month in a long format dataframe For example for each country, and each year I would like to have 12 values, once by month. And the value will be the same as the last one. It means that my column month should be range(1,13) for each country-year.
country_full year month rank rank_date total_points
Zimbabwe 2021 8 108 2021-08-12 1171.88
Zimbabwe 2021 10 108 2021-09-12 1171.88
Zimbabwe 2022 01 108 2022-01-12 1171.88
Germany 1994 01 10 1994-01-10 1171.88
Germany 1994 02 10 1994-02-09 1327.8
Germany 1994 04 10 1994-04-07 1459.9
The desired output would be :
country_full year month rank rank_date total_points
Zimbabwe 2021 8 108 2021-08-12 1171.88
Zimbabwe 2021 9 108 2021-08-12 1171.88
Zimbabwe 2021 10 108 2021-10-12 1171.88
Zimbabwe 2021 11 108 2021-11-12 1171.88
Zimbabwe 2021 12 108 2021-12-12 1171.88
Zimbabwe 2022 01 108 2022-01-12 1171.88
Germany 1994 01 10 1994-01-10 1171.88
Germany 1994 02 10 1994-02-10 1327.8
Germany 1994 03 10 1994-03-09 1459.9
Germany 1994 04 10 1994-03-07 1459.9
For each country and each year having 12 months. And the missing values for the total point column will be the one from the last month.
Any idea of how we can do that with pandas ?
CodePudding user response:
Assuming that all rank_date
are on the day 12
, you can use reindex
with MS
frequency and then use pd.offsets
to add days:
new_indexes = ( pd.date_range(start=f"{df['year'].min()}",
end=f"{df['year'].max() 1}", freq='MS', inclusive='left') pd.offsets.Day(11) )
# forward and backward fills and reindexing the columns to the actual format
new_df = ( df.set_index('rank_date').reindex(new_indexes).ffill().bfill().
reset_index().rename({'index':'rank_date'}, axis=1).reindex(df.columns, axis=1) )
# we need to update year and month columns
new_df['year'] = new_df['rank_date'].dt.year
new_df['month'] = new_df['rank_date'].dt.month
Output:
country_full year month rank rank_date total_points
0 Zimbabwe 2021 1 108.0 2021-01-12 1171.88
1 Zimbabwe 2021 2 108.0 2021-02-12 1171.88
2 Zimbabwe 2021 3 108.0 2021-03-12 1171.88
3 Zimbabwe 2021 4 108.0 2021-04-12 1171.88
4 Zimbabwe 2021 5 108.0 2021-05-12 1171.88
5 Zimbabwe 2021 6 108.0 2021-06-12 1171.88
6 Zimbabwe 2021 7 108.0 2021-07-12 1171.88
7 Zimbabwe 2021 8 108.0 2021-08-12 1171.88
8 Zimbabwe 2021 9 108.0 2021-09-12 1171.88
9 Zimbabwe 2021 10 108.0 2021-10-12 1171.88
10 Zimbabwe 2021 11 108.0 2021-11-12 1171.88
11 Zimbabwe 2021 12 108.0 2021-12-12 1171.88
12 Zimbabwe 2022 1 108.0 2022-01-12 1171.88
13 Zimbabwe 2022 2 108.0 2022-02-12 1171.88
14 Zimbabwe 2022 3 108.0 2022-03-12 1171.88
15 Zimbabwe 2022 4 108.0 2022-04-12 1171.88
16 Zimbabwe 2022 5 108.0 2022-05-12 1171.88
17 Zimbabwe 2022 6 108.0 2022-06-12 1171.88
18 Zimbabwe 2022 7 108.0 2022-07-12 1171.88
19 Zimbabwe 2022 8 108.0 2022-08-12 1171.88
20 Zimbabwe 2022 9 108.0 2022-09-12 1171.88
21 Zimbabwe 2022 10 108.0 2022-10-12 1171.88
22 Zimbabwe 2022 11 108.0 2022-11-12 1171.88
23 Zimbabwe 2022 12 108.0 2022-12-12 1171.88
CodePudding user response:
Here is how I would do it. As far as I can see it produces the desired output. (Used Input at the bottom of the answer).
Create the new time index with MS
(see documentation with the Offset aliases
and combine it to a Multiindex with the country.
(credits to @Nuri Taş on how to define start and end of the date_range
, I just copied that part)
df['date'] = pd.to_datetime(df[['year', 'month']].assign(day=1))
t_index = pd.DatetimeIndex(pd.date_range(start=f"{df['year'].min()}", end=f"{df['year'].max() 1}", freq="MS"))
new_index = pd.MultiIndex.from_product([t_index, df['country_full'].unique()])
Set the relevant columns as index and reindex with the new created Multiindex, sort the data, groupby the country to use ffill
and bfill
df = df.set_index(['date', 'country_full'])
out = (
df
.reindex(new_index)
.sort_index(level=1)
.groupby(level=1)
.ffill()
.bfill()
)
out['year'] = out.index.get_level_values(0).year
out['month'] = out.index.get_level_values(0).month
Output:
year month rank rank_date total_points
2020-01-01 Germany 2020 1 10.0 1994-01-12 1171.88
2020-02-01 Germany 2020 2 10.0 1994-02-12 1327.80
2020-03-01 Germany 2020 3 10.0 1994-02-12 1327.80
2020-04-01 Germany 2020 4 10.0 1994-04-12 1459.90
2020-05-01 Germany 2020 5 10.0 1994-04-12 1459.90
2020-06-01 Germany 2020 6 10.0 1994-04-12 1459.90
2020-07-01 Germany 2020 7 10.0 1994-04-12 1459.90
2020-08-01 Germany 2020 8 10.0 1994-04-12 1459.90
2020-09-01 Germany 2020 9 10.0 1994-04-12 1459.90
2020-10-01 Germany 2020 10 10.0 1994-04-12 1459.90
2020-11-01 Germany 2020 11 10.0 1994-04-12 1459.90
2020-12-01 Germany 2020 12 10.0 1994-04-12 1459.90
2021-01-01 Germany 2021 1 10.0 1994-04-12 1459.90
2021-02-01 Germany 2021 2 10.0 1994-04-12 1459.90
2021-03-01 Germany 2021 3 10.0 1994-04-12 1459.90
2021-04-01 Germany 2021 4 10.0 1994-04-12 1459.90
2021-05-01 Germany 2021 5 10.0 1994-04-12 1459.90
2021-06-01 Germany 2021 6 10.0 1994-04-12 1459.90
2021-07-01 Germany 2021 7 10.0 1994-04-12 1459.90
2021-08-01 Germany 2021 8 10.0 1994-04-12 1459.90
2021-09-01 Germany 2021 9 10.0 1994-04-12 1459.90
2021-10-01 Germany 2021 10 10.0 1994-04-12 1459.90
2021-11-01 Germany 2021 11 10.0 1994-04-12 1459.90
2021-12-01 Germany 2021 12 10.0 1994-04-12 1459.90
2022-01-01 Germany 2022 1 10.0 1994-04-12 1459.90
2022-02-01 Germany 2022 2 10.0 1994-04-12 1459.90
2022-03-01 Germany 2022 3 10.0 1994-04-12 1459.90
2022-04-01 Germany 2022 4 10.0 1994-04-12 1459.90
2022-05-01 Germany 2022 5 10.0 1994-04-12 1459.90
2022-06-01 Germany 2022 6 10.0 1994-04-12 1459.90
2022-07-01 Germany 2022 7 10.0 1994-04-12 1459.90
2022-08-01 Germany 2022 8 10.0 1994-04-12 1459.90
2022-09-01 Germany 2022 9 10.0 1994-04-12 1459.90
2022-10-01 Germany 2022 10 10.0 1994-04-12 1459.90
2022-11-01 Germany 2022 11 10.0 1994-04-12 1459.90
2022-12-01 Germany 2022 12 10.0 1994-04-12 1459.90
2023-01-01 Germany 2023 1 10.0 1994-04-12 1459.90
2020-01-01 Zimbabwe 2020 1 108.0 2021-08-12 1171.88
2020-02-01 Zimbabwe 2020 2 108.0 2021-08-12 1171.88
2020-03-01 Zimbabwe 2020 3 108.0 2021-08-12 1171.88
2020-04-01 Zimbabwe 2020 4 108.0 2021-08-12 1171.88
2020-05-01 Zimbabwe 2020 5 108.0 2021-08-12 1171.88
2020-06-01 Zimbabwe 2020 6 108.0 2021-08-12 1171.88
2020-07-01 Zimbabwe 2020 7 108.0 2021-08-12 1171.88
2020-08-01 Zimbabwe 2020 8 108.0 2021-08-12 1171.88
2020-09-01 Zimbabwe 2020 9 108.0 2021-08-12 1171.88
2020-10-01 Zimbabwe 2020 10 108.0 2021-08-12 1171.88
2020-11-01 Zimbabwe 2020 11 108.0 2021-08-12 1171.88
2020-12-01 Zimbabwe 2020 12 108.0 2021-08-12 1171.88
2021-01-01 Zimbabwe 2021 1 108.0 2021-08-12 1171.88
2021-02-01 Zimbabwe 2021 2 108.0 2021-08-12 1171.88
2021-03-01 Zimbabwe 2021 3 108.0 2021-08-12 1171.88
2021-04-01 Zimbabwe 2021 4 108.0 2021-08-12 1171.88
2021-05-01 Zimbabwe 2021 5 108.0 2021-08-12 1171.88
2021-06-01 Zimbabwe 2021 6 108.0 2021-08-12 1171.88
2021-07-01 Zimbabwe 2021 7 108.0 2021-08-12 1171.88
2021-08-01 Zimbabwe 2021 8 108.0 2021-08-12 1171.88
2021-09-01 Zimbabwe 2021 9 108.0 2021-08-12 1171.88
2021-10-01 Zimbabwe 2021 10 108.0 2021-09-12 1171.88
2021-11-01 Zimbabwe 2021 11 108.0 2021-09-12 1171.88
2021-12-01 Zimbabwe 2021 12 108.0 2021-09-12 1171.88
2022-01-01 Zimbabwe 2022 1 108.0 2022-01-12 1120.00
2022-02-01 Zimbabwe 2022 2 108.0 2022-01-12 1120.00
2022-03-01 Zimbabwe 2022 3 108.0 2022-01-12 1120.00
2022-04-01 Zimbabwe 2022 4 108.0 2022-01-12 1120.00
2022-05-01 Zimbabwe 2022 5 108.0 2022-01-12 1120.00
2022-06-01 Zimbabwe 2022 6 108.0 2022-01-12 1120.00
2022-07-01 Zimbabwe 2022 7 108.0 2022-01-12 1120.00
2022-08-01 Zimbabwe 2022 8 108.0 2022-01-12 1120.00
2022-09-01 Zimbabwe 2022 9 108.0 2022-01-12 1120.00
2022-10-01 Zimbabwe 2022 10 108.0 2022-01-12 1120.00
2022-11-01 Zimbabwe 2022 11 108.0 2022-01-12 1120.00
2022-12-01 Zimbabwe 2022 12 108.0 2022-01-12 1120.00
2023-01-01 Zimbabwe 2023 1 108.0 2022-01-12 1120.00
Data I used as Input:
df = pd.DataFrame({
'country_full': ['Zimbabwe','Zimbabwe','Zimbabwe','Germany','Germany','Germany'],
'year': [2021, 2021, 2022, 2020, 2020, 2020],
'month': [8, 10, 1, 1, 2, 4],
'rank': [108, 108, 108, 10, 10, 10],
'rank_date': ['2021-08-12','2021-09-12','2022-01-12','1994-01-12','1994-02-12','1994-04-12'],
'total_points': [1171.88, 1171.88, 1120.0, 1171.88, 1327.8, 1459.9]
})