I would like to drop all rows from a DataFrame that don't include the first date in the DataFrame of each year and month. Here is an example:
pd.DataFrame([['2016-02-05', 22], ['2016-02-15', 15], ['2016-05-03', 18], ['2016-05-20', 9], ['2017-03-02', 10], ['2018-04-01', 11], ['2018-04-02', 12]],
columns=['date', 'qty'])
date qty
0 2016-02-05 22
1 2016-02-15 15
2 2016-05-03 18
3 2016-05-20 9
4 2017-03-02 10
5 2018-04-01 11
6 2018-04-02 12
I want the above DataFrame to become:
date qty
0 2016-02-05 22
2 2016-05-03 18
4 2017-03-02 10
5 2018-04-01 11
I converted the 'date' column to datetime and tried to do this in a loop. However, I didn't get there and I'm sure there is a more efficient way than doing it in a loop. Thanks for your help!
CodePudding user response:
Try with resample
:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])
output = df.resample("M", on="date").first().dropna().reset_index(drop=True)
>>> output
date qty
0 2016-02-05 22.0
1 2016-05-03 18.0
2 2017-03-02 10.0
3 2018-04-01 11.0
If you want to keep your original index, you can do:
output = df.assign(m=df["date"].dt.to_period("m")).drop_duplicates("m").drop("m",axis=1)
>>> output
date qty
0 2016-02-05 22
2 2016-05-03 18
4 2017-03-02 10
5 2018-04-01 11
CodePudding user response:
Another way, presumably quicker as it avoid to expand the dataframe
df.filter(df['date'].dt.to_period('M').drop_duplicates().index, axis = 0)