Create an SQL Query that selects a limit of 10 from table in pyspark-CodePudding

Can anyone help with the following:

I have a csv file with data about the passengers on the titanic. I will put the code that I have so far and then the problem I'm having after.

#Initialise Sparksession and print the shell line to ensure Spark is running 
# You should get a print out something similar to - <pyspark.sql.session.SparkSession object at 0x7f1d98711460>
from pyspark.sql import SparkSession

spark_ex=SparkSession.builder.getOrCreate()
print(spark_ex)


# Print the tables in the catalog
# spark_ex.catalog.listDatabases()
spark_ex.catalog.listTables()


# Import statement for pandas as pd
# import pyspark.sql.functions as sqlfunc
# from pyspark.sql.types import *
# from pyspark.sql import *
# import pyspark.sql.functions as sqlfunc
# import numpy as np
import pandas as pd

# Read in the train.csv using pd.read_csv and assign it to a variable called 'df1'
df1=pd.read_csv('train.csv')

# Create a second dataframe 'df2' by slicing 'df1' with .iloc on an array of [:, 0:4] for the first 4 columns
df2=pd.DataFrame(df1.iloc[:, 0:4]) 

# Pass in the df2 frame into a variable called 'spark_df' using your spark_ex and .createDataFrame
spark_df=spark_ex.createDataFrame(df2)

# Using your now created spark_df dataframe, name the table 'sample_titanic' using .registerTempTable
# spark_df.registerTempTable("sample_titanic")
spark_df.createOrReplaceTempView("sample_titanic") #used this as I read in documentation that the .register is out of use

# Print the spark_df using .show
spark_df.show()

I think that up to this point everything is ok.

But then I am asked to do this:

1# Create a SQL Query that selects a limit of 10 from your Sample Titanic table

2# Assign a variable that ingests the limit 10 query using spark_ex.sql

for 1, the only answer I can find that works is this one: spark_df.limit(10).collect() but from what I understand this is not an SQL Query but none of the SQL queries I have tried have worked.

Then for 2 I am totally stumped at what I'm being asked to do and how to do it.

Any help would be much appreciated.

CodePudding user response：

There are four ways (or four APIs) to write Spark SQL code: SQL, Python, Scala, and R. I will only talk about SQL and Python here (Scala and R is pretty as similar as Python)

To use Python "interface", you can refer to this API document http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html. An example of this is

df.select('A', 'B').groupBy('A').sum('B').count()

To use SQL "interface", you first have to create a temporary view as you did with spark_df.createOrReplaceTempView("sample_titanic"). From now on, you can write queries like this

spark.sql('select A, B from sample_titanic')

Each of these would return a dataframe that you can use with Python "interface", like this

df2 = df.select('A', 'B')
df3 = spark.sql('...')

Good luck with your assignments :)

CodePudding user response：

This seems to be working. I would love to know what people think:

# Create a SQL Query that selects a limit of 10 from your Sample Titanic table
spark_df.limit(10).collect()

# Assign a variable that ingests the limit 10 query using spark_ex.sql
limit10=spark_ex.createDataFrame(spark_df.limit(10).collect())