Home > Software design >  Python pandas dataframe with daily data - keep first and last rows per month
Python pandas dataframe with daily data - keep first and last rows per month

Time:09-27

I have a Python pandas dataframe that looks like this:

print(dataframe1.head(30))
                    Date   Close* month_initial  month day  year
date_final                                                      
2022-09-23  Sep 23, 2022  3693.23           Sep    9.0  23  2022
2022-09-22  Sep 22, 2022  3757.99           Sep    9.0  22  2022
2022-09-21  Sep 21, 2022  3789.93           Sep    9.0  21  2022
2022-09-20  Sep 20, 2022  3855.93           Sep    9.0  20  2022
2022-09-19  Sep 19, 2022  3899.89           Sep    9.0  19  2022
2022-09-16  Sep 16, 2022  3873.33           Sep    9.0  16  2022
2022-09-15  Sep 15, 2022  3901.35           Sep    9.0  15  2022
2022-09-14  Sep 14, 2022  3946.01           Sep    9.0  14  2022
2022-09-13  Sep 13, 2022  3932.69           Sep    9.0  13  2022
2022-09-12  Sep 12, 2022  4110.41           Sep    9.0  12  2022
2022-09-09  Sep 09, 2022  4067.36           Sep    9.0  09  2022
2022-09-08  Sep 08, 2022  4006.18           Sep    9.0  08  2022
2022-09-07  Sep 07, 2022  3979.87           Sep    9.0  07  2022
2022-09-06  Sep 06, 2022  3908.19           Sep    9.0  06  2022
2022-09-02  Sep 02, 2022  3924.26           Sep    9.0  02  2022
2022-09-01  Sep 01, 2022  3966.85           Sep    9.0  01  2022
2022-08-31  Aug 31, 2022  3955.00           Aug    8.0  31  2022
2022-08-30  Aug 30, 2022  3986.16           Aug    8.0  30  2022
2022-08-29  Aug 29, 2022  4030.61           Aug    8.0  29  2022
2022-08-26  Aug 26, 2022  4057.66           Aug    8.0  26  2022
2022-08-25  Aug 25, 2022  4199.12           Aug    8.0  25  2022
2022-08-24  Aug 24, 2022  4140.77           Aug    8.0  24  2022
2022-08-23  Aug 23, 2022  4128.73           Aug    8.0  23  2022
2022-08-22  Aug 22, 2022  4137.99           Aug    8.0  22  2022
2022-08-19  Aug 19, 2022  4228.48           Aug    8.0  19  2022
2022-08-18  Aug 18, 2022  4283.74           Aug    8.0  18  2022
2022-08-17  Aug 17, 2022  4274.04           Aug    8.0  17  2022
2022-08-16  Aug 16, 2022  4305.20           Aug    8.0  16  2022
2022-08-15  Aug 15, 2022  4297.14           Aug    8.0  15  2022
2022-08-12  Aug 12, 2022  4280.15           Aug    8.0  12  2022

I want to keep the first and the last row per month. How can I do that? I tried using the following code:

import pandas as pd
dataframe1.set_index("date_final", inplace=True)
resultDf = dataframe1.groupby([dataframe1.index.year, dataframe1.index.month]).agg(["first", "last"])
resultDf.index.rename(["year", "month"], inplace=True)
resultDf.reset_index(inplace=True)
resultDf

but I don't get the results I want.

CodePudding user response:

pandas groupby operations don't sort each group prior to aggregation, which is why 'first' and 'last' are not selecting the correct rows for you.

Additionally, you can use .resample('M') instead of a groupby on year & month.

out = (
    df.set_index(df.index.astype('datetime64[ns]')) # copying in the data, I lost the datetime index
    .sort_index()  # sort ensures first and last work as expected
    .resample('M') # resample for a shorthand year/month grouping
    .agg(['first', 'last'])
)

print(out)
                    Date                 Close*          month_initial      month        day       year      
                   first          last    first     last         first last first last first last first  last
date_final                                                                                                   
2022-08-31  Aug 12, 2022  Aug 31, 2022  4280.15  3955.00           Aug  Aug   8.0  8.0    12   31  2022  2022
2022-09-30  Sep 01, 2022  Sep 23, 2022  3966.85  3693.23           Sep  Sep   9.0  9.0     1   23  2022  2022

This output doesn't have the most usable format, so we can use a quick .stack to remedy it:

out = out.stack()

print(out)
                          Date   Close* month_initial  month  day  year
date_final                                                             
2022-08-31 first  Aug 12, 2022  4280.15           Aug    8.0   12  2022
           last   Aug 31, 2022  3955.00           Aug    8.0   31  2022
2022-09-30 first  Sep 01, 2022  3966.85           Sep    9.0    1  2022
           last   Sep 23, 2022  3693.23           Sep    9.0   23  2022

CodePudding user response:

You can use the is_month_start and is_month_end functions

In your case, the following should work

dataframe1.set_index("date_final", inplace=True) 
dataframe1 = dataframe1[dataframe1.index.is_month_start | dataframe1.index.is_month_end]
  • Related