I have 5 CSV files and the header is in only the first file. I want to read and create a dataframe using spark. My code below works, however, I lose 4 rows of data using this method because the header is set to true in the final read. If I set the header to false I get the 4 rows of data back but I also get the actual header from the first file as a row in my data .
Is there a more efficient way to do this so that the header doesn't show up as a row in my dataset?
header = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("path/file-1")
schema = header.schema
df = spark.read \
.format("csv") \
.option("header", "true") \
.schema(schema) \
.load("path")
CodePudding user response:
Unfortunately, I don't think there is an easy way do to what you want. There is a workaround that looks like what you did though. You could read the first file to get the schema, read all the files but the first one with option("header", "false")
and then union the first file and the rest.
In python, it would look like this:
first_file = "path/file-1"
header = spark.read.option("header", "true") \
.option("inferSchema", "true").csv(first_file)
schema = header.schema
# I use binaryFiles simply to get the list of the files in the folder
# Not that the files are not read.
# Any other mean to list files in a directory would do the trick as well.
all_files = files = spark.sparkContext.binaryFiles("path")\
.map(lambda x : x[0]).collect()
all_files_but_first = [f for f in all_files if not f.endswith(first_file)]
df = spark.read.option("header", "false") \
.schema(schema).csv(all_files_but_first)\
.union(header)