What is the difference between pyspark.pandas to pandas?-CodePudding

Starting to use PySpark on Databricks, and I see I can import pyspark.pandas alongside pandas. What is the different? I assume it's not like koalas, right?

CodePudding user response：

PySpark is an interface for Apache Spark in Python. It allows you to write Spark applications using Python and provides the PySpark shell to analyze data in a distributed environment. Pyspark.pandas is an API that allows you to use pandas functions and operations on "spark data frames".

Koalas is another library developed by Databricks that allows running pandas-like operations on spark data.

This blog show some differences between pyspark.pandas and pyspark: https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45

Pyspark.pandas documentation is of course a reference: https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.pandas/index.html

CodePudding user response：

pyspark.pandas is an alternative to pandas, with the same api than pandas. This means you can work with pyspark exactly the same as you work with pandas

for example, to create a dataframe, you use .DataFrame same as with pandas, and use .iloc or .drop_duplicates :

import pyspark.pandas as ps
df = ps.DataFrame({'a': [1, 2], 'b':[3, 4]})
df.sort_valyes('b')

df1 = ps.read_csv('data.csv') 
df1.sort_values(by="date")

also, you can convert the pyspark dataframe to pandas dataframe:

df.to_pandas() #return pandas dataframe