I have a dataframe as follows:
ID val1 sum 3.0 5.0 7.0 9.0 sum_3.0 sum_5.0 sum_7.0 sum_9.0
timestamp
2022-05-09 11:28:00 3.0 -11 100 NaN NaN -45 -67 80 123 0.98 345
2022-05-09 11:28:01 3.0 -7 0.5 NaN NaN -9 NaN 70 45 0.23 23
2022-05-09 11:28:02 5.0 -8 50 -35 NaN -20 NaN 12 23 0.0 14
2022-05-09 11:28:03 5.0 -7 NaN 30 NaN NaN NaN 67 87 0.90 13
From the above dataset, I want to create a new dataframe. In each row, the values for the column ID
, val1
, sum
remains as it is. The columns 3.0
5.0
7.0
9.0
is checked. If there is a value present in those columns we extract that value and corresponding value from the column sum_
and this becomes a new row in the dataframe keeping the timestamp as it is.
For example: In the first row in the columns 7.0 and 9.0 there are values present which are -45 and -67 respecively. The corresponding values for these columns which is sum_7.0 and sum_9.0 is also extracted. So the two new sets (7.0, -45, 0.98) and (9.0, -67, 345) becomes two new rows and is added with the same timestamp.
ID val1 sum
timestamp
2022-05-09 11:28:00 3.0 -11 100
2022-05-09 11:28:00 7.0 -45 0.98
2022-05-09 11:28:00 9.0 -67 345
2022-05-09 11:28:01 3.0 -7 0.5
2022-05-09 11:28:01 7.0 -9 0.23
2022-05-09 11:28:02 5.0 -8 50
2022-05-09 11:28:02 3.0 -35 12
2022-05-09 11:28:02 7.0 -20 0.0
2022-05-09 11:28:03 5.0 -7 NaN
2022-05-09 11:28:03 3.0 30 67
Is there a way to do this?
CodePudding user response:
You could try:
result = (
df
.melt(value_vars=df.columns[3:7], var_name="ID", ignore_index=False)
.rename(columns={"value": "val1"})
.assign(sum=df.melt(value_vars=df.columns[-4:], ignore_index=False)["value"])
.loc[lambda df: df["val1"].notna()]
)
result = pd.concat([df[["ID", "val1", "sum"]], result]).sort_index()
This is essentially
.melt()
-ingdf
twice to get aligned blocks of corresponding values,- masking out the
NaN
-parts (in the first block), - (re)naming the columns appropriately,
- appending the remaining rows to the relevant columns of
df
, and finally - sorting the the new dataframe
result
via its index to bring connected rows together.
Result for the sample:
ID val1 sum
timestamp
2022-05-09 11:28:00 3.0 -11.0 100.00
2022-05-09 11:28:00 7.0 -45.0 0.98
2022-05-09 11:28:00 9.0 -67.0 345.00
2022-05-09 11:28:01 3.0 -7.0 0.50
2022-05-09 11:28:01 7.0 -9.0 0.23
2022-05-09 11:28:02 5.0 -8.0 50.00
2022-05-09 11:28:02 3.0 -35.0 12.00
2022-05-09 11:28:02 7.0 -20.0 0.00
2022-05-09 11:28:03 5.0 -7.0 NaN
2022-05-09 11:28:03 3.0 30.0 67.00