Please can someone explain why spark interchanges the values of two columns when querying a DataFrame?
The values of ProposedAction
are returned for SimpleMatchRate
vise versa.
Here is the code sample:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType as ST, StructField as SF, StringType as STR
spark = (SparkSession.builder
.master("local")
.appName("Fuzzy")
.config("spark.jars", "../jars/mysql-connector-java-8.0.29.jar")
.config("spark.driver.extraClassPath", "../jars/mysql-connector-java-8.0.29.jar")
.getOrCreate())
customschema = ST([
SF("Matched", STR()),
SF("MatchRate", STR()),
SF("ProposedAction", STR()), # e.g. is_new
SF("SimpleMatchRate", STR()), # e.g. 76.99800
SF("Status", STR())])
files = [file for file in glob.glob('../source_files/*fuzzy*')]
df = spark.read.csv(files, sep="\t", header="true", encoding="UTF-8", schema=customschema)
df.printSchema()
root
|-- Matched: string (nullable = true)
|-- MatchRate: string (nullable = true)
|-- ProposedAction: string (nullable = true)
|-- SimpleMatchRate: string (nullable = true)
|-- Status: string (nullable = true)
Now if I try to query the df as a table:
df.createOrReplaceTempView("tmp_table")
spark.sql("""SELECT MatchRate, ProposedAction, SimpleMatchRate
FROM tmp_table LIMIT 5""").show()
I get:
----------- ---------------- -----------------
| MatchRate | ProposedAction | SimpleMatchRate |
----------- ---------------- -----------------
| 0.043169 | 0.000000 | is_new |
| 88.67153 | 98.96907 | is_linked |
| 89.50349 | 98.94736 | is_linked |
| 99.44025 | 100.00000 | is_dupe |
| 90.78082 | 98.92473 | is_linked |
----------- ---------------- -----------------
CodePudding user response:
I found what I was doing wrong. My schema definition did not correctly follow the order of columns in the input file. ProposedAction
comes after SimpleMatchRate
like below:
. . .Matched MatchRate SimpleMatchRate ProposedAction status
I modified the definition to the below and the issue is fixed:
customschema = ST([
SF("Matched", STR()),
SF("MatchRate", STR()),
SF("SimpleMatchRate", STR()),
SF("ProposedAction", STR()), # Now in the correct position as in the input file
SF("Status", STR())])