read and join several parquet files pyspark-CodePudding

I have several parquet files that I would like to read and join (consolidate them in a single file), but I am using a clasic solution which I think is not the best one.

Every file has two id variables used for the join and one variable which has different names in every parquet, so the to have all those variables in the same parquet.

several tables(n tables == n variables)
final output 1 table (id n variables)

Solution I am using

file_path_list = ["file1.parquet","file2.parquet","file3.parquet"]

# read first file
df = spark.read.format('parquet').load(file_path_list[0])

for file_path in file_path_list[1:len(file_path_list)]:
    # read every file 
    df_temp = spark.read.format('parquet').load(file_path)
    
    # join to the first file
    df_spine_predictions = df_spine_predictions.join(df_temp, 
                                                     on =["id_foo","id_bar"],
                                                     how="left")

What I tried unsuccessfully as it is giving only the last dataframe in the list like reading "file3.parquet".

file_path_list = ["file1.parquet",
                  "file2.parquet",
                  "file3.parquet"]

df = spark.read.parquet(*file_path_list)

CodePudding user response：

For your unsuccessful attempt, you need mergeSchema option to read multiple parquet files with a different schema.

file_path_list = ["file1.parquet",
                  "file2.parquet",
                  "file3.parquet"]

df = spark.read.parquet(*file_path_list, mergeSchema=True)

This will give you a result like

 ------ ------ ----- ----- ...
|id_foo|id_bar|var_1|var_2|...
 ------ ------ ----- ----- ...
|     1|     1|  100| null|...
|     2|     2|  200| null|...
|     3|     3|  300| null|...
|     4|     4|  400| null|...
|     1|     1| null|  100|...
|     2|     2| null|  200|...
...
 ------ ------ ----- ----- ...

To surpress these nulls, you can do

(df.groupby(['id_foo', 'id_bar'])
 .agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns if x.startswith('var')])

although if you can change the schema, I would suggest to have same schema for all parquets which is easier to handle IMO. With the same schema, you can still use pivot to achieve your desired output.