I have the following dataframe:
columns = pd.date_range(start="2022-05-21", end="2022-06-30")
data = [
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
]
df = pd.DataFrame(data, columns=columns)
2022-05-21 2022-05-22 2022-05-23 ... 2022-06-28 2022-06-29 2022-06-30
0 0 0 0 ... 5 5 5
1 5 5 5 ... 1 1 1
2 5 5 5 ... 5 5 5
I have to take the first and last column index for every distinct value in the order they are. The correct output for this dataframe will be:
[
[
{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
],
[
{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}
],
[
{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
]
]
My best approach for the moment is:
series_set = df.apply(frozenset, axis=1)
container = []
for index in range(len(df.index)):
row = df.iloc[[index]]
values = series_set.iloc[[index]]
inner_container = []
for value in values[index]:
single_value_series = row[row.columns[row.isin([value]).all()]]
dates = single_value_series.columns
result = dict(value=value, start=dates[0].strftime("%Y-%m-%d"), end=dates[-1].strftime("%Y-%m-%d"))
inner_container.append(result)
container.append(inner_container)
The result is:
[
[
{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
],
[
{'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}
],
[
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-05-21', 'end': '2022-06-30'}
]
]
It has several problems, only the first array is correct :) When I convert dataframe to frozenset it is sorted and order is changed and also if some value appears more than once it is removed.
I will appreciate any idea and guidance. What I want to avoid is iterating the dataframe.
Thank you!
CodePudding user response:
You can first transpose DataFrame
by DataFrame.T
and then aggregate minimal and maximal index with convert values to strings by Series.dt.strftime
, last convert to dictionaries by DataFrame.to_dict
.
For get consecutive groups is compared shifted values with Series.cumsum
.
df1 = df.T.reset_index()
L = [df1.groupby(df1[x].ne(df1[x].shift()).cumsum())
.agg(value=(x, 'first'),
start=('index', 'min'),
end=('index', 'max'))
.assign(start=lambda x: x['start'].dt.strftime('%Y-%m-%d'),
end=lambda x: x['end'].dt.strftime('%Y-%m-%d'))
.to_dict(orient='records') for x in df1.columns.drop('index')]
print (L)
[[{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}],
[{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}],
[{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}]]