df=spark.read.csv('data.csv',header=True,inferSchema=True)
rule_df=spark.read.csv('job_rules.csv',header=True)
query_df=spark.read.csv('rules.csv',header=True)
join_df=rule_df.join(query_df,rule_df.Rule==query_df.Rule,"inner").drop(rule_df.Rule).show()
print(join_df.collect().columns)
Here I have created three dataframes: df,rule_df and query_df. I've performed inner join on rule_df and query_df, and stored the resulting dataframe in join_df. However, when I try to simply print the columns of the join_df dataframe, I get the following error-
AttributeError: 'NoneType' object has no attribute 'columns'
The resultant dataframe is not behaving as one, I'm not able to perform any dataframe operations on it.
I'm guessing this error occurs when you're trying to call an object that doesn't exist, but it shouldn't be the case here as I'm able to view the resultant join_df.
Do I need to perform a different join in order to avoid this error? Might be a silly mistake, but I'm stumped trying to figure out what it is. Please help!
CodePudding user response:
You are doing several mistakes.
First of all you try to assign the return value of .show()
to join_df
which returns None
.
Then you are calling the .collect()
function which returns a list that contains all of the elements in this RDD. You need to call .columns
directly on the DataFrame
.
This should work:
join_df = rule_df.join(query_df,rule_df.Rule==query_df.Rule,"inner").drop(rule_df.Rule)
print(join_df.columns)