Home > Back-end >  When multiple symbols and days occur, how to only keep the first occurrence of the day and symbol?
When multiple symbols and days occur, how to only keep the first occurrence of the day and symbol?

Time:11-25

If I have a dataframe of daily data which contain symbols and different dates:

     level_0   index        date symbol    open  ...  volume_10_day  is_downtrending  is_downtrending_lookback  consolidating_10  consolidating_10_lookback
0       3608    3608  2022-10-26   CIFR  0.8600  ...         3883.2                0                         0                 0                          1
1      11367   11367  2022-09-12   CLVS  1.2800  ...        24749.8                0                         0                 0                          1
2      13031   13031  2022-10-06    CGC  3.0700  ...      3807474.9                0                         0                 0                          1
3      13044   13044  2022-10-25    CGC  2.4000  ...      4213340.1                0                         0                 0                          1
4      13864   13864  2022-09-02   CMCM  4.9100  ...         3560.0                0                         0                 0                          1
..       ...     ...         ...    ...     ...  ...            ...              ...                       ...               ...                        ...
353   684622  684622  2022-10-24   SOBR  3.2500  ...        65830.2                0                         0                 0                          1
354   685045  685045  2022-08-29   SNTG  2.6500  ...        12765.3                0                         1                 0                          1
355   685093  685093  2022-11-04   SNTG  4.6889  ...     17969582.7                0                         0                 0                          0
356   686851  686851  2022-10-11    WNW  0.8700  ...         5172.1                0                         0                 0                          1
357   688103  688103  2022-10-11    BHG  0.8750  ...         1489.5                0                         1                 0                          1

[358 rows x 18 columns]

Sometimes, there are multiplies of the same days but with different symbols. For example, on 2022-10-11 there are two symbols which occur: WNW, BHG.

356   686851  686851  2022-10-11    WNW  0.8700  ...         5172.1                0                         0                 0                          1
357   688103  688103  2022-10-11    BHG  0.8750  ...         1489.5                0                         1                 0                          1

When this happens, I only want the first instance to be returned (all other symbols occurring on the same day should be removed), something like:

     level_0   index        date symbol    open  ...  volume_10_day  is_downtrending  is_downtrending_lookback  consolidating_10  consolidating_10_lookback
0       3608    3608  2022-10-26   CIFR  0.8600  ...         3883.2                0                         0                 0                          1
1      11367   11367  2022-09-12   CLVS  1.2800  ...        24749.8                0                         0                 0                          1
2      13031   13031  2022-10-06    CGC  3.0700  ...      3807474.9                0                         0                 0                          1
3      13044   13044  2022-10-25    CGC  2.4000  ...      4213340.1                0                         0                 0                          1
4      13864   13864  2022-09-02   CMCM  4.9100  ...         3560.0                0                         0                 0                          1
..       ...     ...         ...    ...     ...  ...            ...              ...                       ...               ...                        ...
353   684622  684622  2022-10-24   SOBR  3.2500  ...        65830.2                0                         0                 0                          1
354   685045  685045  2022-08-29   SNTG  2.6500  ...        12765.3                0                         1                 0                          1
355   685093  685093  2022-11-04   SNTG  4.6889  ...     17969582.7                0                         0                 0                          0
356   686851  686851  2022-10-11    WNW  0.8700  ...         5172.1                0                         0                 0                          1

[357 rows x 18 columns]

Where in the duplicate of WNW, BHG, only the first one (WNW) is returned.

How can I do this? Something like:

df_filtered.drop_duplicates(subset=['date', 'symbol'], inplace=True)

Any help is much appreciated

CodePudding user response:

Per the discussion in the comments, this solution works:

df_filtered.drop_duplicates(subset=['date'], keep='first', inplace=True)
  • Related