I am trying to use pyspark to load a text file, sort it, and then write the sorted data to a different text file.
I have a simple script that correctly reads in the data, sorts it by key, then writes it to an output file.
#!/usr/bin/python3
import sys
from pyspark.sql import SparkSession
if len(sys.argv) != 5:
print("Usage: ./sorter.py -i [input filename] -o [output filename]")
sys.exit(1)
input_filename = sys.argv[2]
output_filename = sys.argv[4]
spark = SparkSession.builder \
.master("local[*]") \
.appName("sorter") \
.getOrCreate()
input_rdd = spark.sparkContext.textFile(input_filename)
print("# partitions: {}".format(input_rdd.getNumPartitions()))
sorted_list = input_rdd.map(lambda x: (x[:10], x[:])) \
.sortByKey() \
.collect()
with open(output_filename, "w") as ofile:
for line in sorted_list:
ofile.write(line[1] '\n')
The output file appears to be sorted correctly. However, the input is generated with gensort
and the output validated with valsort
, and when ./valsort output_file
is run my machine outputs
sump pump fatal error: pfunc_get_rec: partial record of 90 bytes found at end of input
I manually created a correct output file, and when I compare it to the pyspark-generated output file it finds no difference with vimdiff
, thinks the files are completely different when I run diff
and outputs
output/spark_output.txt output/correct_output.txt differ: byte 99, line 1
when I run cmp
on them
CodePudding user response:
Gensort appends a carriage return (a.k.a \r
and ^M
) to the end of every line, and these are somehow not captured when the input file is read in. Problem is solved with this code:
ofile.write(line[1] '\r\n')