I have been using Pyspark (which I have setup locally on my machine) to query a large data set, which has worked well so far. However, I would like to create pandas data frames for some of the data.
SummaryStatList = []
for i in dataColumns:
SummaryStatList.append(df.groupby('Status').agg(func.min(df[i]).alias(i ' Min'),
func.max(df[i]).alias(i ' Max'),
func.mean(df[i]).alias(i ' Mean'),
func.stddev(df[i]).alias(i ' Variance'),
func.percentile_approx(i,0.5).alias(i ' Median')))
dfSummaryStat = pd.DataFrame(SummaryStatList)
But upon creating a pandas data frame I get this error:
raceback (most recent call last):
File "f:\Homework\UniCS\CS Year 3\Big Data\.env_BigData\BigDataAssingment.py", line 6, in <module>
from pyspark.sql.pandas._typing import PandasDataFrame
ImportError: cannot import name 'PandasDataFrame' from 'pyspark.sql.pandas._typing' (unknown location)
These are the imported libraries:
import os
import findspark
import pandas as pd
import matplotlib
from pandas.core.frame import DataFrame
from pyspark.sql.pandas._typing import PandasDataFrame
import seaborn
import numpy as np
import sklearn
from pyspark.ml.feature import Imputer
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext("local", "Big Data Assignment")
sql = SQLContext(sc)
import pyspark.sql.functions as func
import pyspark
from pyspark.sql import SparkSession
Now getting this error:
f:\Homework\UniCS\CS Year 3\Big Data\.env_BigData\lib\site-packages\pyspark\sql\context.py:77: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
warnings.warn(
Traceback (most recent call last):
File "f:\Homework\UniCS\CS Year 3\Big Data\.env_BigData\lib\site-packages\findspark.py", line 143, in init
py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "f:\Homework\UniCS\CS Year 3\Big Data\.env_BigData\BigDataAssingment.py", line 20, in <module>
findspark.init()
File "f:\Homework\UniCS\CS Year 3\Big Data\.env_BigData\lib\site-packages\findspark.py", line 145, in init
raise Exception(
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
Have no idea why since it worked before?
CodePudding user response:
Try this
wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
or manually download it, then unzip it, then add the following lines to your python file
import os
os.environ["SPARK_HOME"] = "/path/to/file/spark-3.0.1-bin-hadoop3.2"
CodePudding user response:
Just use .toPandas()
pandas_df = spark_df.toPandas()