how can i read multiple csv files and merge them in single dataframe in pyspark-CodePudding

I have 4 CSV files with different columns. Some csv have same column name as well. the details of csv are:

capstone_customers.csv: [customer_id, customer_type, repeat_customer]

capstone_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]

capstone_recent_customers.csv: [customer_id, customer_type]

capstone_recent_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]

my code are

df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")


from functools import reduce
def unite_dfs(df1, df2):
  return df2.union(df1)

list_of_dfs = [df1, df2,df3,df4]
united_df = reduce(unite_dfs, list_of_dfs)

but i got the error "Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 3 columns;;\n'Union\n:- Relation[invoice_id#234,product_id#235,customer_id#236,days_until_shipped#237,product_line#238,total#239] csv\n - Relation[customer_id#218,customer_type#219,repeat_customer#220] csv\n"

how can i merge in a single data frame and remove same column names using pyspark

CodePudding user response：

you can provide list of files or path to files to read, instead of reading one by one. And don't forget about mergeSchema option:

files = [
   "capstone_customers.csv",
   "capstone_invoices.csv",
   "capstone_recent_customers.csv",
   "capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)

# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')

CodePudding user response：

To read multiple files in shark you can make list of all files you want and read them at once, you don't have to read them in order.

Here is an example of code you can use:

path = ['file.cvs','file.cvs']
 
df = spark.read.options(header=True).csv(path)
df.show()