Home > OS >  pandas dataframe in a long format missing months
pandas dataframe in a long format missing months

Time:10-24

I can't find a way to add the month in a long format dataframe For example for each country, and each year I would like to have 12 values, once by month. And the value will be the same as the last one. It means that my column month should be range(1,13) for each country-year.

country_full    year    month        rank   rank_date   total_points    
Zimbabwe       2021        8         108    2021-08-12  1171.88
Zimbabwe       2021        10        108    2021-09-12  1171.88
Zimbabwe       2022        01        108    2022-01-12  1171.88
Germany        1994        01        10     1994-01-10  1171.88
Germany        1994        02        10     1994-02-09  1327.8
Germany        1994        04        10     1994-04-07  1459.9

The desired output would be :

country_full    year    month        rank   rank_date   total_points    
    Zimbabwe       2021        8         108    2021-08-12  1171.88
    Zimbabwe       2021        9         108    2021-08-12  1171.88
    Zimbabwe       2021        10        108    2021-10-12  1171.88
    Zimbabwe       2021        11        108    2021-11-12  1171.88
    Zimbabwe       2021        12        108    2021-12-12  1171.88
    Zimbabwe       2022        01        108    2022-01-12  1171.88
    Germany        1994        01        10     1994-01-10  1171.88
    Germany        1994        02        10     1994-02-10  1327.8
    Germany        1994        03        10     1994-03-09  1459.9
    Germany        1994        04        10     1994-03-07  1459.9

For each country and each year having 12 months. And the missing values for the total point column will be the one from the last month.

Any idea of how we can do that with pandas ?

CodePudding user response:

Assuming that all rank_date are on the day 12, you can use reindex with MS frequency and then use pd.offsets to add days:

new_indexes = ( pd.date_range(start=f"{df['year'].min()}", 
  end=f"{df['year'].max()   1}", freq='MS', inclusive='left')   pd.offsets.Day(11) )

# forward and backward fills and reindexing the columns to the actual format
new_df = ( df.set_index('rank_date').reindex(new_indexes).ffill().bfill().
      reset_index().rename({'index':'rank_date'}, axis=1).reindex(df.columns, axis=1) )

# we need to update year and month columns                                                                                                                           
new_df['year'] = new_df['rank_date'].dt.year
new_df['month'] = new_df['rank_date'].dt.month

Output:

country_full    year    month   rank    rank_date   total_points
0   Zimbabwe    2021    1   108.0   2021-01-12  1171.88
1   Zimbabwe    2021    2   108.0   2021-02-12  1171.88
2   Zimbabwe    2021    3   108.0   2021-03-12  1171.88
3   Zimbabwe    2021    4   108.0   2021-04-12  1171.88
4   Zimbabwe    2021    5   108.0   2021-05-12  1171.88
5   Zimbabwe    2021    6   108.0   2021-06-12  1171.88
6   Zimbabwe    2021    7   108.0   2021-07-12  1171.88
7   Zimbabwe    2021    8   108.0   2021-08-12  1171.88
8   Zimbabwe    2021    9   108.0   2021-09-12  1171.88
9   Zimbabwe    2021    10  108.0   2021-10-12  1171.88
10  Zimbabwe    2021    11  108.0   2021-11-12  1171.88
11  Zimbabwe    2021    12  108.0   2021-12-12  1171.88
12  Zimbabwe    2022    1   108.0   2022-01-12  1171.88
13  Zimbabwe    2022    2   108.0   2022-02-12  1171.88
14  Zimbabwe    2022    3   108.0   2022-03-12  1171.88
15  Zimbabwe    2022    4   108.0   2022-04-12  1171.88
16  Zimbabwe    2022    5   108.0   2022-05-12  1171.88
17  Zimbabwe    2022    6   108.0   2022-06-12  1171.88
18  Zimbabwe    2022    7   108.0   2022-07-12  1171.88
19  Zimbabwe    2022    8   108.0   2022-08-12  1171.88
20  Zimbabwe    2022    9   108.0   2022-09-12  1171.88
21  Zimbabwe    2022    10  108.0   2022-10-12  1171.88
22  Zimbabwe    2022    11  108.0   2022-11-12  1171.88
23  Zimbabwe    2022    12  108.0   2022-12-12  1171.88

CodePudding user response:

Here is how I would do it. As far as I can see it produces the desired output. (Used Input at the bottom of the answer).

Create the new time index with MS (see documentation with the Offset aliases and combine it to a Multiindex with the country. (credits to @Nuri Taş on how to define start and end of the date_range, I just copied that part)

df['date'] = pd.to_datetime(df[['year', 'month']].assign(day=1))
t_index = pd.DatetimeIndex(pd.date_range(start=f"{df['year'].min()}", end=f"{df['year'].max()   1}", freq="MS"))

new_index = pd.MultiIndex.from_product([t_index, df['country_full'].unique()])

Set the relevant columns as index and reindex with the new created Multiindex, sort the data, groupby the country to use ffill and bfill

df = df.set_index(['date', 'country_full'])

out = (
    df
    .reindex(new_index)
    .sort_index(level=1)
    .groupby(level=1)
    .ffill()
    .bfill()
)

out['year'] = out.index.get_level_values(0).year
out['month'] = out.index.get_level_values(0).month

Output:

                     year  month   rank   rank_date  total_points
2020-01-01 Germany   2020      1   10.0  1994-01-12       1171.88
2020-02-01 Germany   2020      2   10.0  1994-02-12       1327.80
2020-03-01 Germany   2020      3   10.0  1994-02-12       1327.80
2020-04-01 Germany   2020      4   10.0  1994-04-12       1459.90
2020-05-01 Germany   2020      5   10.0  1994-04-12       1459.90
2020-06-01 Germany   2020      6   10.0  1994-04-12       1459.90
2020-07-01 Germany   2020      7   10.0  1994-04-12       1459.90
2020-08-01 Germany   2020      8   10.0  1994-04-12       1459.90
2020-09-01 Germany   2020      9   10.0  1994-04-12       1459.90
2020-10-01 Germany   2020     10   10.0  1994-04-12       1459.90
2020-11-01 Germany   2020     11   10.0  1994-04-12       1459.90
2020-12-01 Germany   2020     12   10.0  1994-04-12       1459.90
2021-01-01 Germany   2021      1   10.0  1994-04-12       1459.90
2021-02-01 Germany   2021      2   10.0  1994-04-12       1459.90
2021-03-01 Germany   2021      3   10.0  1994-04-12       1459.90
2021-04-01 Germany   2021      4   10.0  1994-04-12       1459.90
2021-05-01 Germany   2021      5   10.0  1994-04-12       1459.90
2021-06-01 Germany   2021      6   10.0  1994-04-12       1459.90
2021-07-01 Germany   2021      7   10.0  1994-04-12       1459.90
2021-08-01 Germany   2021      8   10.0  1994-04-12       1459.90
2021-09-01 Germany   2021      9   10.0  1994-04-12       1459.90
2021-10-01 Germany   2021     10   10.0  1994-04-12       1459.90
2021-11-01 Germany   2021     11   10.0  1994-04-12       1459.90
2021-12-01 Germany   2021     12   10.0  1994-04-12       1459.90
2022-01-01 Germany   2022      1   10.0  1994-04-12       1459.90
2022-02-01 Germany   2022      2   10.0  1994-04-12       1459.90
2022-03-01 Germany   2022      3   10.0  1994-04-12       1459.90
2022-04-01 Germany   2022      4   10.0  1994-04-12       1459.90
2022-05-01 Germany   2022      5   10.0  1994-04-12       1459.90
2022-06-01 Germany   2022      6   10.0  1994-04-12       1459.90
2022-07-01 Germany   2022      7   10.0  1994-04-12       1459.90
2022-08-01 Germany   2022      8   10.0  1994-04-12       1459.90
2022-09-01 Germany   2022      9   10.0  1994-04-12       1459.90
2022-10-01 Germany   2022     10   10.0  1994-04-12       1459.90
2022-11-01 Germany   2022     11   10.0  1994-04-12       1459.90
2022-12-01 Germany   2022     12   10.0  1994-04-12       1459.90
2023-01-01 Germany   2023      1   10.0  1994-04-12       1459.90
2020-01-01 Zimbabwe  2020      1  108.0  2021-08-12       1171.88
2020-02-01 Zimbabwe  2020      2  108.0  2021-08-12       1171.88
2020-03-01 Zimbabwe  2020      3  108.0  2021-08-12       1171.88
2020-04-01 Zimbabwe  2020      4  108.0  2021-08-12       1171.88
2020-05-01 Zimbabwe  2020      5  108.0  2021-08-12       1171.88
2020-06-01 Zimbabwe  2020      6  108.0  2021-08-12       1171.88
2020-07-01 Zimbabwe  2020      7  108.0  2021-08-12       1171.88
2020-08-01 Zimbabwe  2020      8  108.0  2021-08-12       1171.88
2020-09-01 Zimbabwe  2020      9  108.0  2021-08-12       1171.88
2020-10-01 Zimbabwe  2020     10  108.0  2021-08-12       1171.88
2020-11-01 Zimbabwe  2020     11  108.0  2021-08-12       1171.88
2020-12-01 Zimbabwe  2020     12  108.0  2021-08-12       1171.88
2021-01-01 Zimbabwe  2021      1  108.0  2021-08-12       1171.88
2021-02-01 Zimbabwe  2021      2  108.0  2021-08-12       1171.88
2021-03-01 Zimbabwe  2021      3  108.0  2021-08-12       1171.88
2021-04-01 Zimbabwe  2021      4  108.0  2021-08-12       1171.88
2021-05-01 Zimbabwe  2021      5  108.0  2021-08-12       1171.88
2021-06-01 Zimbabwe  2021      6  108.0  2021-08-12       1171.88
2021-07-01 Zimbabwe  2021      7  108.0  2021-08-12       1171.88
2021-08-01 Zimbabwe  2021      8  108.0  2021-08-12       1171.88
2021-09-01 Zimbabwe  2021      9  108.0  2021-08-12       1171.88
2021-10-01 Zimbabwe  2021     10  108.0  2021-09-12       1171.88
2021-11-01 Zimbabwe  2021     11  108.0  2021-09-12       1171.88
2021-12-01 Zimbabwe  2021     12  108.0  2021-09-12       1171.88
2022-01-01 Zimbabwe  2022      1  108.0  2022-01-12       1120.00
2022-02-01 Zimbabwe  2022      2  108.0  2022-01-12       1120.00
2022-03-01 Zimbabwe  2022      3  108.0  2022-01-12       1120.00
2022-04-01 Zimbabwe  2022      4  108.0  2022-01-12       1120.00
2022-05-01 Zimbabwe  2022      5  108.0  2022-01-12       1120.00
2022-06-01 Zimbabwe  2022      6  108.0  2022-01-12       1120.00
2022-07-01 Zimbabwe  2022      7  108.0  2022-01-12       1120.00
2022-08-01 Zimbabwe  2022      8  108.0  2022-01-12       1120.00
2022-09-01 Zimbabwe  2022      9  108.0  2022-01-12       1120.00
2022-10-01 Zimbabwe  2022     10  108.0  2022-01-12       1120.00
2022-11-01 Zimbabwe  2022     11  108.0  2022-01-12       1120.00
2022-12-01 Zimbabwe  2022     12  108.0  2022-01-12       1120.00
2023-01-01 Zimbabwe  2023      1  108.0  2022-01-12       1120.00

Data I used as Input:

df = pd.DataFrame({
    'country_full': ['Zimbabwe','Zimbabwe','Zimbabwe','Germany','Germany','Germany'],
    'year': [2021, 2021, 2022, 2020, 2020, 2020],
    'month': [8, 10, 1, 1, 2, 4],
    'rank': [108, 108, 108, 10, 10, 10],
    'rank_date': ['2021-08-12','2021-09-12','2022-01-12','1994-01-12','1994-02-12','1994-04-12'],
    'total_points': [1171.88, 1171.88, 1120.0, 1171.88, 1327.8, 1459.9]
})
  • Related