Home > Software engineering >  Spark Shell: SQL Query doesn't return any results when data is integer/double
Spark Shell: SQL Query doesn't return any results when data is integer/double

Time:12-17

I am using the MongoDB Spark Connector to import data from MongoDB and then perform some SQL queries. I will describe the whole process before getting into the actual problem in case I have done something wrong since it's the first time I am using these tools.

I initialize spark-shell with the specific Collection, including the connector package:

spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection"      
 --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1

I enable the connector:

import com.mongodb.spark._

I make an RDD of the collection I want:

val rdd = MongoSpark.load(sc)

I make a Dataframe out of it so I can use SQL on it:

val df = rdd.toDF()

I create a temporary view of the dataset:

df.createOrReplaceTempView("sales")

The database has 1 million entries in this format:

_id: 61a6540c3838fe02b81e5339
Region: "Sub-Saharan Africa"
Country: "Ghana"
Item Type: "Office Supplies"
Sales Channel: "Online"
Order Priority: "L"
Order Date: 2017-03-22T22:00:00.000 00:00
Order ID: 601245963
Ship Date: 2017-04-14T21:00:00.000 00:00
Units Sold: 896
Unit Price: 651.21
Unit Cost: 524.96
Total Revenue: 583484.16
Total Cost: 470364.16
Total Profit: 113120

The datatypes are string/date/number/double.

A simple query like this works just fine:

spark.sql("SELECT * FROM sales").show()

Queries on strings work fine too:

spark.sql("SELECT Country FROM sales WHERE Region = 'Sub-Saharan Africa'").show()

But when I do a query on numerical values I get back an empty table:

spark.sql("SELECT * FROM sales WHERE 'Units Sold' > 4000").show()

I have tried assigning the query to a val and then using show() on it but it's the same thing.

CodePudding user response:

If it is not a typo/cut-n-paste error in your SELECT, the WHERE clause in it compares string "Units Sold" to a numeric value 4000 which is never true. The proper way to escape column names in SparkSQL is using a ` (backticks) not an ' (apostrophes).

So use the following query

spark.sql('SELECT * FROM sales WHERE `Unit Sold` > 1').show()
  • Related