Pyspark spark.read.csv().collect() return an empty list-CodePudding

I've been trying to use df.collect() method to see the content of my cells in pyspark 3.1.2 but it keeps returning an empty list : etp.collect() [] Even though etp.show() is giving me results

The code i'm using :

from pyspark.sql import SparkSession
spark = SparkSession\
    .builder\
    .appName('Read_csv')\
    .getOrCreate()

etp= spark.read.options(header=True)\
.options(delimiter=';')\
.options(inferschema='true')\
.csv("mypath\etp.csv")

etp.collect()

I've tried changing the delimiter, same problem/

My goal is to iterate on the content of the cell based of the row number, but if i cant access the content it's no use. Any ideas of things i could try or change ?

Thanks in advance

Edit : I'm using jupyter notebook Edit 2 : I've tried other operations such as withColumn... and they seem to work. Select().show() also. It feels like .collect() has been changed but i cant find the info

CodePudding user response：

Using the following Dataframe as an example, which has a distinct ROW_ID column

 ------ ---- --- 
|ROW_ID|NAME|AGE|
 ------ ---- --- 
|     1|John| 50|
|     2|Anna| 32|
|     3|Josh| 41|
|     4|Paul| 98|
 ------ ---- ---

You can access the name cell of the 3rd row with the following

df.where(df["ROW_ID"] == 3).collect()[0]["NAME"]

Feel free to recreate this example with the following code

from pyspark.sql import types

data = [ 
        [1, "John", 50],
        [2, "Anna", 32],
        [3, "Josh", 41],
        [4, "Paul", 98],
        ]

arr_schema = (types.StructType([
        types.StructField('ROW_ID', types.IntegerType()),
        types.StructField('NAME', types.StringType()),
        types.StructField('AGE', types.IntegerType()),
        ]))

df = spark.createDataFrame(data, schema=arr_schema)

df.where(df["ROW_ID"] == 3).collect()[0]["NAME"]

CodePudding user response：

I suspected a bad installation. So i uninstalled anaconda and then created a virtual env where i installed only the packages i needed and it worked.