Home > Blockchain >  Why did Spark interchange values of two columns?
Why did Spark interchange values of two columns?

Time:07-06

Please can someone explain why spark interchanges the values of two columns when querying a DataFrame?

The values of ProposedAction are returned for SimpleMatchRate vise versa.

Here is the code sample:

import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType as ST, StructField as SF, StringType as STR

spark = (SparkSession.builder
    .master("local")
    .appName("Fuzzy")
    .config("spark.jars", "../jars/mysql-connector-java-8.0.29.jar")
    .config("spark.driver.extraClassPath", "../jars/mysql-connector-java-8.0.29.jar")
    .getOrCreate())

customschema = ST([
  SF("Matched", STR()),
  SF("MatchRate", STR()),
  SF("ProposedAction", STR()), # e.g. is_new
  SF("SimpleMatchRate", STR()), # e.g. 76.99800
  SF("Status", STR())])

files = [file for file in glob.glob('../source_files/*fuzzy*')]
df = spark.read.csv(files, sep="\t", header="true", encoding="UTF-8", schema=customschema)
df.printSchema()
root
 |-- Matched: string (nullable = true)
 |-- MatchRate: string (nullable = true)
 |-- ProposedAction: string (nullable = true)
 |-- SimpleMatchRate: string (nullable = true)
 |-- Status: string (nullable = true)

Now if I try to query the df as a table:

df.createOrReplaceTempView("tmp_table")

spark.sql("""SELECT MatchRate, ProposedAction, SimpleMatchRate
          FROM tmp_table  LIMIT 5""").show()

I get:

 ----------- ---------------- ----------------- 
| MatchRate | ProposedAction | SimpleMatchRate |
 ----------- ---------------- ----------------- 
|  0.043169 |       0.000000 |          is_new |
|  88.67153 |       98.96907 |       is_linked |
|  89.50349 |       98.94736 |       is_linked |
|  99.44025 |      100.00000 |         is_dupe |
|  90.78082 |       98.92473 |       is_linked |
 ----------- ---------------- ----------------- 

CodePudding user response:

I found what I was doing wrong. My schema definition did not correctly follow the order of columns in the input file. ProposedAction comes after SimpleMatchRate like below:

. . .Matched MatchRate   SimpleMatchRate ProposedAction  status

I modified the definition to the below and the issue is fixed:

customschema = ST([
  SF("Matched", STR()),
  SF("MatchRate", STR()),
  SF("SimpleMatchRate", STR()),
  SF("ProposedAction", STR()), # Now in the correct position as in the input file
  SF("Status", STR())])
  • Related