Home > Software design >  Apply value to a dataframe after filtering another dataframe based on conditions
Apply value to a dataframe after filtering another dataframe based on conditions

Time:12-30

My question: There are two dataframes and the info one is in progress to build. What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column. The output dataframe is the final one I want to do.

Dataframe (info)

 ----- ----- 
|  key|value|
 ----- ----- 
|    a|   10|
|    b|   20|
|    c|   50|
|    d|   40|
 ----- ----- 

Dataframe (Reference)

 ----- ----- 
|  key|value|
 ----- ----- 
|    a|   42|
|    b|    2|
|    c|    9|
|    d|  100|
 ----- ----- 

Below is the output I want: Dataframe (Output)

 ----- ----- ----- 
|  key|value|const|
 ----- ----- ----- 
|    a|   10|    2|
|    b|   20|    2|
|    c|   50|    2|
|    d|   40|    2|
 ----- ----- ----- 

I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.

df_cal = (
    info
    .join(reference)
    .withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
    .select('key', 'result', 'const')
)

df_cal.show()

And below is the warming that reminded by system:

AttributeError: 'Dataframe' object has no attribute 'loc'

CodePudding user response:

This solve:

from pyspark.sql.functions import lit

target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()

 --- ----- ----- 
|key|value|const|
 --- ----- ----- 
|  a|   10|    2|
|  b|   20|    2|
|  c|   30|    2|
|  d|   40|    2|
 --- ----- ----- 
  • Related