Running unionByName on Spark 2.2.0-CodePudding

I am trying to run a unionByName command to combine two dataframes, but when I run my script, the log shows me that "DataFrame object has no attribute 'unionByName'".

df_new = old.unionByName(old2, allowMissingColumns=True)

I sense that it has to do with my Spark or Python version as union is working perfectly fine. The version is 2.2.0.cloudera1. How can I use a newer version of Spark or even use the unionByName command with my existing version?

I also see this in my log

File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/importlib/__init__.py", line 37, in import_module

So I sense that I am using Python 2.7?

Thanks!

CodePudding user response：

unionByName

New in version 2.3.

CodePudding user response：

The difference between this function and union() is that this function resolves columns by name (not by position). So if you can do it without change version create a dataframe by reordering your old2 DataFrame. for example if your DataFrames is as below:

old ("col1","col2","col3")

old2("col3","col1","col2")

use something like below:

old3 = old2.select(col("col1"),col("col2"),col("col3"));

new_old = old.union(old3);