I have a weather forecasting dataset and I am interesd in the columns:
period
(values: 1,2,3)temp2m
: corresponds to a temperature 2 meters away from a weather station.
p1 = new_df.where(new_df.period == 1).select([c for c in df.columns if c in ['period','temp2m']]).show()
This code for p1 gives the following (first 5):
------ ------
|period|temp2m|
------ ------
| 0| 12|
| 0| 13|
| 0| 11|
| 0| 13|
| 0| 10|
------ ------
I would like to store the results of temp2m
as temp2m_p1
in the main DataFrame new_df
. Similarly I'd like to add temp2m_p2
and temp2m_p2
as well. However I have trouble finding a solution to this problem on https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/.
CodePudding user response:
Please always provide a toy example and expected result. Here is mine:
new_df = pd.DataFrame({'period': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'temp2m': [12, 13, 12, 20, 21, 22, 18, 18, 16]})
period temp2m
0 1 12
1 1 13
2 1 12
3 2 20
4 2 21
5 2 22
6 3 18
7 3 18
8 3 16
I believe you want:
for p in new_df['period'].unique():
new_df[f'temp2m_p{p}'] = new_df['temp2m'].where(new_df['period'] == p)
Which results in:
period temp2m temp2m_p1 temp2m_p2 temp2m_p3
0 1 12 12.0 NaN NaN
1 1 13 13.0 NaN NaN
2 1 12 12.0 NaN NaN
3 2 20 NaN 20.0 NaN
4 2 21 NaN 21.0 NaN
5 2 22 NaN 22.0 NaN
6 3 18 NaN NaN 18.0
7 3 18 NaN NaN 18.0
8 3 16 NaN NaN 16.0
EDIT after the comments:
df_transformed = pd.concat((new_df[new_df['period'] == p]['temp2m'].rename(f'temp2m_{p}').reset_index(drop=True) for p in new_df['period'].unique()), axis=1)
That gives:
temp2m_1 temp2m_2 temp2m_3
0 12 20 18
1 13 21 18
2 12 22 16