Home > Mobile >  What are the trade offs of using python syntax instead of spark sql
What are the trade offs of using python syntax instead of spark sql

Time:09-01

For example, I have 2 syntaxes that accomplish the same thing on a finance data frame:

Spark SQL

df.filter("Close < 500").show()

PySpark

df.filter(df["Close"] < 500).show()

Is one of them better for any reason like performance, readability or something else I'm not thinking about?

I'm asking because I'm about to start implementing Pyspark in my company and whatever route I chose will probably became cannon there.

Thanks!

CodePudding user response:

It really depends on your use case, I highly suggest you read these topics so you can have a better idea of what each of these do; I think this covers pretty much all you need to know when it comes to the decision making.

  1. What is PySpark
  2. The difference between Spark and PySpark
  3. What happens when you run PySpark
  4. Spark vs PySpark

Good luck!

CodePudding user response:

I guess it depends on your coworkers: if they mostly use SQL, Spark SQL will have a big selling point (not that this should be the main reason to decide).

For readability and more importantly refactoring possibilities, I would go with plain Dataframes. And if you are concerned about performance, you can always do df.explain() for both options and compare.

This all goes for spark.sql() containing complex queries. For the examples above I do not think it really matters.

  • Related