Home > database >  Keep only the rows with the first occurring date of each year and month in a pandas DataFrame
Keep only the rows with the first occurring date of each year and month in a pandas DataFrame

Time:11-19

I would like to drop all rows from a DataFrame that don't include the first date in the DataFrame of each year and month. Here is an example:

pd.DataFrame([['2016-02-05', 22], ['2016-02-15', 15], ['2016-05-03', 18], ['2016-05-20', 9], ['2017-03-02', 10], ['2018-04-01', 11], ['2018-04-02', 12]],
             columns=['date', 'qty'])
         date  qty
0  2016-02-05   22
1  2016-02-15   15
2  2016-05-03   18
3  2016-05-20    9
4  2017-03-02   10
5  2018-04-01   11
6  2018-04-02   12

I want the above DataFrame to become:

         date  qty
0  2016-02-05   22
2  2016-05-03   18
4  2017-03-02   10
5  2018-04-01   11

I converted the 'date' column to datetime and tried to do this in a loop. However, I didn't get there and I'm sure there is a more efficient way than doing it in a loop. Thanks for your help!

CodePudding user response:

Try with resample:

#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])
output = df.resample("M", on="date").first().dropna().reset_index(drop=True)

>>> output
        date   qty
0 2016-02-05  22.0
1 2016-05-03  18.0
2 2017-03-02  10.0
3 2018-04-01  11.0

If you want to keep your original index, you can do:

output = df.assign(m=df["date"].dt.to_period("m")).drop_duplicates("m").drop("m",axis=1)

>>> output
        date  qty
0 2016-02-05   22
2 2016-05-03   18
4 2017-03-02   10
5 2018-04-01   11

CodePudding user response:

Another way, presumably quicker as it avoid to expand the dataframe

df.filter(df['date'].dt.to_period('M').drop_duplicates().index, axis = 0)
  • Related