Home > Back-end >  Create a new dataframe depending on the rows and columns of an other dataframe
Create a new dataframe depending on the rows and columns of an other dataframe

Time:08-31

I have a dataframe as follows:

                       ID   val1  sum   3.0   5.0   7.0   9.0  sum_3.0  sum_5.0  sum_7.0 sum_9.0
timestamp       
2022-05-09 11:28:00    3.0  -11   100   NaN   NaN   -45   -67    80       123     0.98     345         
2022-05-09 11:28:01    3.0  -7    0.5   NaN   NaN   -9    NaN    70       45      0.23     23
2022-05-09 11:28:02    5.0  -8    50    -35   NaN   -20   NaN    12       23      0.0      14
2022-05-09 11:28:03    5.0  -7    NaN    30   NaN    NaN  NaN    67       87      0.90     13

From the above dataset, I want to create a new dataframe. In each row, the values for the column ID , val1, sum remains as it is. The columns 3.0 5.0 7.0 9.0 is checked. If there is a value present in those columns we extract that value and corresponding value from the column sum_ and this becomes a new row in the dataframe keeping the timestamp as it is.

For example: In the first row in the columns 7.0 and 9.0 there are values present which are -45 and -67 respecively. The corresponding values for these columns which is sum_7.0 and sum_9.0 is also extracted. So the two new sets (7.0, -45, 0.98) and (9.0, -67, 345) becomes two new rows and is added with the same timestamp.

                       ID   val1  sum   
timestamp       
2022-05-09 11:28:00    3.0  -11   100  
2022-05-09 11:28:00    7.0  -45   0.98
2022-05-09 11:28:00    9.0  -67   345  
2022-05-09 11:28:01    3.0  -7    0.5   
2022-05-09 11:28:01    7.0  -9    0.23
2022-05-09 11:28:02    5.0  -8    50  
2022-05-09 11:28:02    3.0  -35   12
2022-05-09 11:28:02    7.0  -20   0.0
2022-05-09 11:28:03    5.0  -7    NaN    
2022-05-09 11:28:03    3.0   30   67

Is there a way to do this?

CodePudding user response:

You could try:

result = (
    df
    .melt(value_vars=df.columns[3:7], var_name="ID", ignore_index=False)
    .rename(columns={"value": "val1"})
    .assign(sum=df.melt(value_vars=df.columns[-4:], ignore_index=False)["value"])
    .loc[lambda df: df["val1"].notna()]
)
result = pd.concat([df[["ID", "val1", "sum"]], result]).sort_index()

This is essentially

  • .melt()-ing df twice to get aligned blocks of corresponding values,
  • masking out the NaN-parts (in the first block),
  • (re)naming the columns appropriately,
  • appending the remaining rows to the relevant columns of df, and finally
  • sorting the the new dataframe result via its index to bring connected rows together.

Result for the sample:

                      ID  val1     sum
timestamp                             
2022-05-09 11:28:00  3.0 -11.0  100.00
2022-05-09 11:28:00  7.0 -45.0    0.98
2022-05-09 11:28:00  9.0 -67.0  345.00
2022-05-09 11:28:01  3.0  -7.0    0.50
2022-05-09 11:28:01  7.0  -9.0    0.23
2022-05-09 11:28:02  5.0  -8.0   50.00
2022-05-09 11:28:02  3.0 -35.0   12.00
2022-05-09 11:28:02  7.0 -20.0    0.00
2022-05-09 11:28:03  5.0  -7.0     NaN
2022-05-09 11:28:03  3.0  30.0   67.00
  • Related