Home > Net >  Create an SQL Query that selects a limit of 10 from table in pyspark
Create an SQL Query that selects a limit of 10 from table in pyspark

Time:09-24

Can anyone help with the following:

I have a csv file with data about the passengers on the titanic. I will put the code that I have so far and then the problem I'm having after.

#Initialise Sparksession and print the shell line to ensure Spark is running 
# You should get a print out something similar to - <pyspark.sql.session.SparkSession object at 0x7f1d98711460>
from pyspark.sql import SparkSession

spark_ex=SparkSession.builder.getOrCreate()
print(spark_ex)


# Print the tables in the catalog
# spark_ex.catalog.listDatabases()
spark_ex.catalog.listTables()


# Import statement for pandas as pd
# import pyspark.sql.functions as sqlfunc
# from pyspark.sql.types import *
# from pyspark.sql import *
# import pyspark.sql.functions as sqlfunc
# import numpy as np
import pandas as pd

# Read in the train.csv using pd.read_csv and assign it to a variable called 'df1'
df1=pd.read_csv('train.csv')

# Create a second dataframe 'df2' by slicing 'df1' with .iloc on an array of [:, 0:4] for the first 4 columns
df2=pd.DataFrame(df1.iloc[:, 0:4]) 

# Pass in the df2 frame into a variable called 'spark_df' using your spark_ex and .createDataFrame
spark_df=spark_ex.createDataFrame(df2)

# Using your now created spark_df dataframe, name the table 'sample_titanic' using .registerTempTable
# spark_df.registerTempTable("sample_titanic")
spark_df.createOrReplaceTempView("sample_titanic") #used this as I read in documentation that the .register is out of use

# Print the spark_df using .show
spark_df.show()

I think that up to this point everything is ok.

But then I am asked to do this:

1# Create a SQL Query that selects a limit of 10 from your Sample Titanic table

2# Assign a variable that ingests the limit 10 query using spark_ex.sql

for 1, the only answer I can find that works is this one: spark_df.limit(10).collect() but from what I understand this is not an SQL Query but none of the SQL queries I have tried have worked.

Then for 2 I am totally stumped at what I'm being asked to do and how to do it.

Any help would be much appreciated.

CodePudding user response:

There are four ways (or four APIs) to write Spark SQL code: SQL, Python, Scala, and R. I will only talk about SQL and Python here (Scala and R is pretty as similar as Python)

To use Python "interface", you can refer to this API document http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html. An example of this is

df.select('A', 'B').groupBy('A').sum('B').count()

To use SQL "interface", you first have to create a temporary view as you did with spark_df.createOrReplaceTempView("sample_titanic"). From now on, you can write queries like this

spark.sql('select A, B from sample_titanic')

Each of these would return a dataframe that you can use with Python "interface", like this

df2 = df.select('A', 'B')
df3 = spark.sql('...')

Good luck with your assignments :)

CodePudding user response:

This seems to be working. I would love to know what people think:

# Create a SQL Query that selects a limit of 10 from your Sample Titanic table
spark_df.limit(10).collect()

# Assign a variable that ingests the limit 10 query using spark_ex.sql
limit10=spark_ex.createDataFrame(spark_df.limit(10).collect())
  • Related