Home > Software engineering >  bytes missing after pyspark writes to file
bytes missing after pyspark writes to file

Time:12-18

I am trying to use pyspark to load a text file, sort it, and then write the sorted data to a different text file.

I have a simple script that correctly reads in the data, sorts it by key, then writes it to an output file.

#!/usr/bin/python3
import sys
from pyspark.sql import SparkSession

if len(sys.argv) != 5:
    print("Usage: ./sorter.py -i [input filename] -o [output filename]")
    sys.exit(1)

input_filename = sys.argv[2]

output_filename = sys.argv[4]

spark = SparkSession.builder \
                    .master("local[*]") \
                    .appName("sorter") \
                    .getOrCreate()

input_rdd = spark.sparkContext.textFile(input_filename)

print("# partitions: {}".format(input_rdd.getNumPartitions()))

sorted_list = input_rdd.map(lambda x: (x[:10], x[:])) \
                        .sortByKey() \
                        .collect()

with open(output_filename, "w") as ofile:
    for line in sorted_list:
        ofile.write(line[1]   '\n')

The output file appears to be sorted correctly. However, the input is generated with gensort and the output validated with valsort, and when ./valsort output_file is run my machine outputs

sump pump fatal error: pfunc_get_rec: partial record of 90 bytes found at end of input

I manually created a correct output file, and when I compare it to the pyspark-generated output file it finds no difference with vimdiff, thinks the files are completely different when I run diff and outputs

output/spark_output.txt output/correct_output.txt differ: byte 99, line 1

when I run cmp on them

CodePudding user response:

Gensort appends a carriage return (a.k.a \r and ^M) to the end of every line, and these are somehow not captured when the input file is read in. Problem is solved with this code:

ofile.write(line[1]   '\r\n')
  • Related