Home > Mobile >  How to put values to the new separate DataFrame columns based on the value of specific already exist
How to put values to the new separate DataFrame columns based on the value of specific already exist

Time:12-18

I have a DataFrame with the mixed data of the different sources, note, that there a portions of data obtined at the same timestamp:

 -------------------------------------- ------ ------------------- ----------------- --------------- ----------------------- 
|devicename                            |value |time               |one_type_id|another_type_id|write_time             |
 -------------------------------------- ------ ------------------- ----------------- --------------- ----------------------- 
|Real_Power_KPI                        |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.129|
|Voltage_Sensor                        |243.93|2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.129|
|Current_Sensor                        |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.129|
|Casing_Vibration_Sensor               |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.369|
|Water_Temperature_Sensor              |17.0  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.369|
|Environment_Ambient_Temperature_Sensor|17.0  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.369|
|Pump_Vibration_Sensor                 |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.369|
|Water_Level_Sensor                    |15.0  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.369|
|Environment_Humidity_Sensor           |81.2  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:36.369|
|Water_Temperature_Sensor              |17.0  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Casing_Vibration_Sensor               |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Pump_Vibration_Sensor                 |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Environment_Ambient_Temperature_Sensor|17.0  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Water_Level_Sensor                    |15.0  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Environment_Humidity_Sensor           |81.2  |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Real_Power_KPI                        |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Voltage_Sensor                        |245.01|2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Current_Sensor                        |0.0   |2021-03-24 07:06:35|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Real_Power_KPI                        |0.0   |2021-03-24 07:06:36|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Voltage_Sensor                        |244.31|2021-03-24 07:06:36|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |
|Current_Sensor                        |0.0   |2021-03-24 07:06:36|NP20100000       |NP20100000     |2021-03-24 07:06:37.01 |

so, what i want is to have separate columns for Real_Power_KPI, Voltage_Sensor, Current_Sensor with their corresponding values joined in one row, while having the same timestamp.

Something like

|timestamp          |Real_Power_KPI|Voltage_Sensor|Current_Sensor|
|2021-03-24 07:06:36|0.0           |244.31        |0.0           |

so how I can do this transpose operation the most optimal way?

UPD.

In 过过招's answer the Python code is proposed, below is the Scala for that:

val df = dailySensorData.filter("devicename in ('Real_Power_KPI', 'Voltage_Sensor', 'Current_Sensor')") 
    .groupBy("time", "devicename").agg(expr("sum(value) as total")) 
    .groupBy("time").pivot("devicename").agg(expr("first(total)"))
df.show(false)

CodePudding user response:

Group and summarize first, and then use pivot to turn rows into columns.

df = df.filter("devicename in ('Real_Power_KPI', 'Voltage_Sensor', 'Current_Sensor')") \
    .groupBy('time', 'devicename').agg(F.expr('sum(value) as total')) \
    .groupBy('time').pivot('devicename').agg(F.expr('first(total)'))
df.show(truncate=False)
  • Related