Home > Software engineering >  Add a column to a DataFrame after selecting rows based on column values
Add a column to a DataFrame after selecting rows based on column values

Time:05-05

I have a weather forecasting dataset and I am interesd in the columns:

  • period (values: 1,2,3)
  • temp2m: corresponds to a temperature 2 meters away from a weather station.

p1 = new_df.where(new_df.period == 1).select([c for c in df.columns if c in ['period','temp2m']]).show()

This code for p1 gives the following (first 5):

 ------ ------ 
|period|temp2m|
 ------ ------ 
|     0|    12|
|     0|    13|
|     0|    11|
|     0|    13|
|     0|    10|
 ------ ------ 

I would like to store the results of temp2m as temp2m_p1 in the main DataFrame new_df. Similarly I'd like to add temp2m_p2 and temp2m_p2 as well. However I have trouble finding a solution to this problem on https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/.

CodePudding user response:

Please always provide a toy example and expected result. Here is mine:

new_df = pd.DataFrame({'period': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                       'temp2m': [12, 13, 12, 20, 21, 22, 18, 18, 16]})
   period  temp2m
0       1      12
1       1      13
2       1      12
3       2      20
4       2      21
5       2      22
6       3      18
7       3      18
8       3      16

I believe you want:

for p in new_df['period'].unique():
    new_df[f'temp2m_p{p}'] = new_df['temp2m'].where(new_df['period'] == p)

Which results in:

   period  temp2m  temp2m_p1  temp2m_p2  temp2m_p3
0       1      12       12.0        NaN        NaN
1       1      13       13.0        NaN        NaN
2       1      12       12.0        NaN        NaN
3       2      20        NaN       20.0        NaN
4       2      21        NaN       21.0        NaN
5       2      22        NaN       22.0        NaN
6       3      18        NaN        NaN       18.0
7       3      18        NaN        NaN       18.0
8       3      16        NaN        NaN       16.0

EDIT after the comments:

df_transformed = pd.concat((new_df[new_df['period'] == p]['temp2m'].rename(f'temp2m_{p}').reset_index(drop=True) for p in new_df['period'].unique()), axis=1)

That gives:

   temp2m_1  temp2m_2  temp2m_3
0      12      20      18
1      13      21      18
2      12      22      16
  • Related