Home > OS >  pandas drop_duplicates on blocks of duplicates
pandas drop_duplicates on blocks of duplicates

Time:01-14

I am trying to find an elegant solution to turn a complete time series of slowly moving data into a compressed form that I can then use with a pd.merge_asof.

For example below data frame:

2023-01-01  A
2023-01-02  B
2023-01-03  B
2023-01-04  B
2023-01-05  C
2023-01-06  C
2023-01-07  A
2023-01-08  B

What I want to achieve is to keep the first record of a duplicate block, i.e.

2023-01-01  A
2023-01-02  B
2023-01-05  C
2023-01-07  A
2023-01-08  B

What comes to mind is df.sort_index().drop_duplicates(keep='first') but it does a global groupby and doesn't account for A and B showing up again later.

Many thanks for your help :)

CodePudding user response:

Assuming that your 2nd column called name (for ex.), short approach with pandas.DataFrame.shift function to capture the 1st unique value in each consecutive group (walking down):

df[df['name'] != df['name'].shift(1)]

           name
2023-01-01    A
2023-01-02    B
2023-01-05    C
2023-01-07    A
2023-01-08    B

CodePudding user response:

You should use shift function to compare each row with the previous row and then check for changes in the value column. Now you can use boolean indexing to select rows where the changes occurred.

df["date"] = pd.to_datetime(df["date"])
df["change"] = df["value"] != df["value"].shift()
df = df[df["change"]]

df =df.drop(columns=["change"])

Output:

        date value
0 2023-01-01     A
1 2023-01-02     B
4 2023-01-05     C
6 2023-01-07     A
7 2023-01-08     B
  • Related