My question: There are two dataframes and the info one is in progress to build. What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column. The output dataframe is the final one I want to do.
Dataframe (info)
----- -----
| key|value|
----- -----
| a| 10|
| b| 20|
| c| 50|
| d| 40|
----- -----
Dataframe (Reference)
----- -----
| key|value|
----- -----
| a| 42|
| b| 2|
| c| 9|
| d| 100|
----- -----
Below is the output I want: Dataframe (Output)
----- ----- -----
| key|value|const|
----- ----- -----
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
----- ----- -----
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
CodePudding user response:
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
--- ----- -----
|key|value|const|
--- ----- -----
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
--- ----- -----