Pyspark code error: Invalid argument, not a string or column-CodePudding

I am running this code and I'd like to get only some columns back, not all columns of all tables that participated in the join.

df_final = df.join(df1,(df['sbr_brand']==df1['sbr_brand'])\
               &(df['sbr_number']==df1['sbr_number'])\
               &(df['calendar_date']==df1['calendar_date'])\
               &(df['check_number']==df1['check_number']))\
         .join(df2,(df['sbr_brand']==df2['brand'])\
               &(df['sbr_number']==df2['store_number'])\
               &(df['calendar_date']==df2['date_of_business'])\
               &(df['check_number']==df2['check_number']),'inner')\
         .select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])

I have an error:

Invalid argument, not a string or column: DataFrame[check_line_number: bigint, item_barcode: string, dining_option: string, item_quantity: double, item_gross_amount: decimal(38,6), item_net_amount: decimal(38,6)] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I removed the select statement from the bottom and the code ran perfectly. I then ran the command below and it showed me all columns of all 3 dataframes.

display(df_final)

I also ran a separate command to see if it makes a difference:

df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number','item_barcode','dining_option','item_quantity','item_gross_amount','item_net_amount'],df2['brand_id'])

But I was given the same error. Not sure how to fix this. Please advise.

CodePudding user response：

Try using something as below -

Sample Input Dataframes

df1 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])
df2 = spark.createDataFrame(data=[(1,1,3),(2,1,1),(2,2,3),(1,2,3),(1,2,1)], schema=['id1', 'id2', 'value'])

Output (Using Joins)

df1.join(df2, (df1["id1"] == df2["id1"]) & (df1["id2"] == df2["id2"]) & (df1["value"] == df2["value"])).select(df1["id1"], df2["id2"], df1["value"]).show()

 --- --- ----- 
|id1|id2|value|
 --- --- ----- 
|  1|  1|    3|
|  1|  2|    1|
|  1|  2|    3|
|  2|  1|    1|
|  2|  2|    3|
 --- --- -----

CodePudding user response：

df['col1'] return a Column, while df['col1', 'col2'] return a DataFrame.

The argument of select function must be a string or Column. So it should be:

df_final2 = df_final.select(df['modifier_gross_amount'],df1['check_line_number'],df1['item_barcode'],df1['dining_option'],df1['item_quantity'],df1['item_gross_amount'],df1['item_net_amount'],df2['brand_id'])