I've been trying to use df.collect()
method to see the content of my cells in pyspark 3.1.2
but it keeps returning an empty list : etp.collect() []
Even though etp.show()
is giving me results
The code i'm using :
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName('Read_csv')\
.getOrCreate()
etp= spark.read.options(header=True)\
.options(delimiter=';')\
.options(inferschema='true')\
.csv("mypath\etp.csv")
etp.collect()
I've tried changing the delimiter, same problem/
My goal is to iterate on the content of the cell based of the row number, but if i cant access the content it's no use. Any ideas of things i could try or change ?
Thanks in advance
Edit : I'm using jupyter notebook Edit 2 : I've tried other operations such as withColumn... and they seem to work. Select().show() also. It feels like .collect() has been changed but i cant find the info
CodePudding user response:
Using the following Dataframe as an example, which has a distinct ROW_ID
column
------ ---- ---
|ROW_ID|NAME|AGE|
------ ---- ---
| 1|John| 50|
| 2|Anna| 32|
| 3|Josh| 41|
| 4|Paul| 98|
------ ---- ---
You can access the name
cell of the 3rd row with the following
df.where(df["ROW_ID"] == 3).collect()[0]["NAME"]
Feel free to recreate this example with the following code
from pyspark.sql import types
data = [
[1, "John", 50],
[2, "Anna", 32],
[3, "Josh", 41],
[4, "Paul", 98],
]
arr_schema = (types.StructType([
types.StructField('ROW_ID', types.IntegerType()),
types.StructField('NAME', types.StringType()),
types.StructField('AGE', types.IntegerType()),
]))
df = spark.createDataFrame(data, schema=arr_schema)
df.where(df["ROW_ID"] == 3).collect()[0]["NAME"]
CodePudding user response:
I suspected a bad installation. So i uninstalled anaconda and then created a virtual env where i installed only the packages i needed and it worked.