Home > other >  Consolidation of consecutive rows by condition with Python Pandas
Consolidation of consecutive rows by condition with Python Pandas

Time:07-29

I try to handle the next data issue. I have a dataframe of values and their labels list (this is multi-class, so the labels are a list).

The dataframe looks like:

     | value| labels
---------------------
row_1| A    |[label1]
row_2| B    |[label2]
row_3| C    |[label3, label4]
row_4| D    |[label4, label5]

I want to find all rows that have a specific label and then:

  1. Firstly, concatenate it with the next row - the string will be concatenated before the next row's value.
  2. Secondly, the labels will be appended to the label list of the next row

For example, if I want to do that for label2, the desired output will be:

     | value| labels
---------------------
row_1| A    |[label1]
row_3| BC   |[label2, label3, label4]
row_4| D    |[label4, label5]

The value "B" is joined before the next row's values, and the label "label2" will be appended to the beginning of the next row's label list. The indexes are not relevant for me.

I would greatly appreciate help with this. I tried to use, merge, join, shift, and cumsum but without success so far.


The following code creates the data in the example:

data = {'row_1': ["A", ["label1"]], 'row_2': ["B", ["label2"]],
        'row_3':["C", ["label3", "label4"]], 'row_4': ["D", ["label4", "label5"]]}
df = pd.DataFrame.from_dict(data, orient='index').rename(columns={0: "value", 1: "labels"})

CodePudding user response:

You could create a grouping variable and use that to aggregate the columns

import pandas as pd
import numpy as np

def my_combine(data, value):
    index = data['labels'].apply(lambda x: np.isin(value, x))
    if(all(~index)):
        return data
    idx = (index | index.shift()).to_numpy()
    vals = (np.arange(idx.size)   1) *(~idx)
    gr = np.r_[np.where(vals[1:] != vals[:-1])[0], vals.size - 1]
    groups = np.repeat(gr, np.diff(np.r_[-1, gr]) )
    return data.groupby(groups).agg(sum)
my_combine(df, 'label2')

  value                    labels
0     A                  [label1]
2    BC  [label2, label3, label4]
3     D          [label4, label5]
  • Related