Pandas, get first and last column index for row value-CodePudding

I have the following dataframe:

columns = pd.date_range(start="2022-05-21", end="2022-06-30")
data = [
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
    [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
]
df = pd.DataFrame(data, columns=columns)
 2022-05-21  2022-05-22  2022-05-23  ...  2022-06-28  2022-06-29  2022-06-30
0           0           0           0  ...           5           5           5
1           5           5           5  ...           1           1           1
2           5           5           5  ...           5           5           5

I have to take the first and last column index for every distinct value in the order they are. The correct output for this dataframe will be:

[
    [
        {'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'}, 
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
    ],
    [
        {'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
        {'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}
    ],
    [
        {'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
    ]
]

My best approach for the moment is:

series_set = df.apply(frozenset, axis=1)
container = []
for index in range(len(df.index)):
    row = df.iloc[[index]]
    values = series_set.iloc[[index]]

    inner_container = []
    for value in values[index]:
        single_value_series = row[row.columns[row.isin([value]).all()]]
        dates = single_value_series.columns
        result = dict(value=value, start=dates[0].strftime("%Y-%m-%d"), end=dates[-1].strftime("%Y-%m-%d"))
        inner_container.append(result)

    container.append(inner_container)

The result is:

[
    [
        {'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'}, 
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
    ],
    [
        {'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}, 
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}
    ],
    [
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-05-21', 'end': '2022-06-30'}
    ]
]

It has several problems, only the first array is correct :) When I convert dataframe to frozenset it is sorted and order is changed and also if some value appears more than once it is removed.

I will appreciate any idea and guidance. What I want to avoid is iterating the dataframe.

Thank you!

CodePudding user response：

You can first transpose DataFrame by DataFrame.T and then aggregate minimal and maximal index with convert values to strings by Series.dt.strftime, last convert to dictionaries by DataFrame.to_dict.

For get consecutive groups is compared shifted values with Series.cumsum.

df1 = df.T.reset_index()
L = [df1.groupby(df1[x].ne(df1[x].shift()).cumsum())
        .agg(value=(x, 'first'),
             start=('index', 'min'),
             end=('index', 'max'))
        .assign(start=lambda x: x['start'].dt.strftime('%Y-%m-%d'),
                end=lambda x: x['end'].dt.strftime('%Y-%m-%d'))
        .to_dict(orient='records') for x in df1.columns.drop('index')]
print (L)
[[{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'}, 
  {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
  {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}],
 [{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}, 
  {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
  {'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}],
 [{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}, 
  {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
  {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}]]