I try to handle the next data issue. I have a dataframe of values and their labels list (this is multi-class, so the labels are a list).
The dataframe looks like:
| value| labels
---------------------
row_1| A |[label1]
row_2| B |[label2]
row_3| C |[label3, label4]
row_4| D |[label4, label5]
I want to find all rows that have a specific label and then:
- Firstly, concatenate it with the next row - the string will be concatenated before the next row's value.
- Secondly, the labels will be appended to the label list of the next row
For example, if I want to do that for label2
, the desired output will be:
| value| labels
---------------------
row_1| A |[label1]
row_3| BC |[label2, label3, label4]
row_4| D |[label4, label5]
The value "B" is joined before the next row's values, and the label "label2" will be appended to the beginning of the next row's label list. The indexes are not relevant for me.
I would greatly appreciate help with this. I tried to use, merge
, join
, shift
, and cumsum
but without success so far.
The following code creates the data in the example:
data = {'row_1': ["A", ["label1"]], 'row_2': ["B", ["label2"]],
'row_3':["C", ["label3", "label4"]], 'row_4': ["D", ["label4", "label5"]]}
df = pd.DataFrame.from_dict(data, orient='index').rename(columns={0: "value", 1: "labels"})
CodePudding user response:
You could create a grouping variable and use that to aggregate the columns
import pandas as pd
import numpy as np
def my_combine(data, value):
index = data['labels'].apply(lambda x: np.isin(value, x))
if(all(~index)):
return data
idx = (index | index.shift()).to_numpy()
vals = (np.arange(idx.size) 1) *(~idx)
gr = np.r_[np.where(vals[1:] != vals[:-1])[0], vals.size - 1]
groups = np.repeat(gr, np.diff(np.r_[-1, gr]) )
return data.groupby(groups).agg(sum)
my_combine(df, 'label2')
value labels
0 A [label1]
2 BC [label2, label3, label4]
3 D [label4, label5]