Can anyone help with the following:
I have a csv file with data about the passengers on the titanic. I will put the code that I have so far and then the problem I'm having after.
#Initialise Sparksession and print the shell line to ensure Spark is running
# You should get a print out something similar to - <pyspark.sql.session.SparkSession object at 0x7f1d98711460>
from pyspark.sql import SparkSession
spark_ex=SparkSession.builder.getOrCreate()
print(spark_ex)
# Print the tables in the catalog
# spark_ex.catalog.listDatabases()
spark_ex.catalog.listTables()
# Import statement for pandas as pd
# import pyspark.sql.functions as sqlfunc
# from pyspark.sql.types import *
# from pyspark.sql import *
# import pyspark.sql.functions as sqlfunc
# import numpy as np
import pandas as pd
# Read in the train.csv using pd.read_csv and assign it to a variable called 'df1'
df1=pd.read_csv('train.csv')
# Create a second dataframe 'df2' by slicing 'df1' with .iloc on an array of [:, 0:4] for the first 4 columns
df2=pd.DataFrame(df1.iloc[:, 0:4])
# Pass in the df2 frame into a variable called 'spark_df' using your spark_ex and .createDataFrame
spark_df=spark_ex.createDataFrame(df2)
# Using your now created spark_df dataframe, name the table 'sample_titanic' using .registerTempTable
# spark_df.registerTempTable("sample_titanic")
spark_df.createOrReplaceTempView("sample_titanic") #used this as I read in documentation that the .register is out of use
# Print the spark_df using .show
spark_df.show()
I think that up to this point everything is ok.
But then I am asked to do this:
1# Create a SQL Query that selects a limit of 10 from your Sample Titanic table
2# Assign a variable that ingests the limit 10 query using spark_ex.sql
for 1, the only answer I can find that works is this one: spark_df.limit(10).collect() but from what I understand this is not an SQL Query but none of the SQL queries I have tried have worked.
Then for 2 I am totally stumped at what I'm being asked to do and how to do it.
Any help would be much appreciated.
CodePudding user response:
There are four ways (or four APIs) to write Spark SQL code: SQL, Python, Scala, and R. I will only talk about SQL and Python here (Scala and R is pretty as similar as Python)
To use Python "interface", you can refer to this API document http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html. An example of this is
df.select('A', 'B').groupBy('A').sum('B').count()
To use SQL "interface", you first have to create a temporary view as you did with spark_df.createOrReplaceTempView("sample_titanic")
. From now on, you can write queries like this
spark.sql('select A, B from sample_titanic')
Each of these would return a dataframe that you can use with Python "interface", like this
df2 = df.select('A', 'B')
df3 = spark.sql('...')
Good luck with your assignments :)
CodePudding user response:
This seems to be working. I would love to know what people think:
# Create a SQL Query that selects a limit of 10 from your Sample Titanic table
spark_df.limit(10).collect()
# Assign a variable that ingests the limit 10 query using spark_ex.sql
limit10=spark_ex.createDataFrame(spark_df.limit(10).collect())