I am running this code and I'd like to get only some columns back, not all columns of all tables that participated in the join.
df_final = df.join(df1,(df['sbr_brand']==df1['sbr_brand'])\
&(df['sbr_number']==df1['sbr_number'])\
&(df['calendar_date']==df1['calendar_date'])\
&(df['check_number']==df1['check_number']))\
.join(df2,(df['sbr_brand']==df2['brand'])\
&(df['sbr_number']==df2['store_number'])\
&(df['calendar_date']==df2['date_of_business'])\
&(df['check_number']==df2['check_number']),'inner')\
.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
I have an error:
Invalid argument, not a string or column: DataFrame[check_line_number: bigint, item_barcode: string, dining_option: string, item_quantity: double, item_gross_amount: decimal(38,6), item_net_amount: decimal(38,6)] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I removed the select statement from the bottom and the code ran perfectly. I then ran the command below and it showed me all columns of all 3 dataframes.
display(df_final)
I also ran a separate command to see if it makes a difference:
df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])
But I was given the same error. Not sure how to fix this. Please advise.
CodePudding user response:
Try using something as below -
Sample Input Dataframes
df1 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
df2 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
Output (Using Joins)
df1.join(df2, (df1["id1"] == df2["id1"]) & (df1["id2"] == df2["id2"]) & (df1["value"] == df2["value"])).select(df1["id1"], df2["id2"], df1["value"]).show()
--- --- -----
|id1|id2|value|
--- --- -----
| 1| 1| 3|
| 1| 2| 1|
| 1| 2| 3|
| 2| 1| 1|
| 2| 2| 3|
--- --- -----
CodePudding user response:
df['col1']
return a Column
, while df['col1', 'col2']
return a DataFrame
.
The argument of select
function must be a string or Column. So it should be:
df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number'],df1['item_barcode'],df1['dining_option'],df1['item_quantity'],df1['item_gross_amount'],df1['item_net_amount'],df2['brand_id'])