Starting to use PySpark on Databricks, and I see I can import pyspark.pandas
alongside pandas
. What is the different?
I assume it's not like koalas
, right?
CodePudding user response:
PySpark is an interface for Apache Spark in Python. It allows you to write Spark applications using Python and provides the PySpark shell to analyze data in a distributed environment.
Pyspark.pandas
is an API that allows you to use pandas functions and operations on "spark data frames".
Koalas is another library developed by Databricks that allows running pandas-like operations on spark data.
This blog show some differences between pyspark.pandas and pyspark: https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45
Pyspark.pandas documentation is of course a reference: https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.pandas/index.html
CodePudding user response:
pyspark.pandas is an alternative to pandas, with the same api than pandas. This means you can work with pyspark exactly the same as you work with pandas
for example, to create a dataframe, you use .DataFrame
same as with pandas, and use .iloc
or .drop_duplicates
:
import pyspark.pandas as ps
df = ps.DataFrame({'a': [1, 2], 'b':[3, 4]})
df.sort_valyes('b')
df1 = ps.read_csv('data.csv')
df1.sort_values(by="date")
also, you can convert the pyspark dataframe to pandas dataframe:
df.to_pandas() #return pandas dataframe