Home > Software engineering >  Pyspark dropped column not gone
Pyspark dropped column not gone

Time:11-19

I have a spark dataframe. I attempt to drop a column, but in some situations the column appears to still be there.

my_range = spark.range(1000).toDF("number")
new_range = my_range.withColumn('num2', my_range.number*2).drop('number')

# can still sort by "number" column
new_range.sort('number')

Is this a bug? Or am I missing something?

Spark version is v3.3.1 python 3 I'm on a Mbook pro M1 20221

CodePudding user response:

,I called .explain(True) on your sample dataset, lets take a look at output:

== Parsed Logical Plan ==
'Sort ['number ASC NULLS FIRST], true
 - Project [num2#61L]
    - Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L]
       - Project [id#57L AS number#59L]
          - Range (0, 1000, step=1, splits=Some(8))

Parsed Logical Plan is first "raw" version of query plan. Here you can see Project [num2#61L] before sort - this is your drop

But at next stage (Analyzed Logical Plan) its different:

== Analyzed Logical Plan ==
num2: bigint
Project [num2#61L]
 - Sort [number#59L ASC NULLS FIRST], true
    - Project [num2#61L, number#59L]
       - Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L]
          - Project [id#57L AS number#59L]
             - Range (0, 1000, step=1, splits=Some(8))

Spark was smart enough to figure out that you need this column, so project before sort includes this column right now. To be compliant with your code, there is new Project added after sort

Now last stage, so optimized logical plan:

== Optimized Logical Plan ==
Project [num2#61L]
 - Sort [number#59L ASC NULLS FIRST], true
    - Project [(id#57L * 2) AS num2#61L, id#57L AS number#59L]
       - Range (0, 1000, step=1, splits=Some(8))

In my opinion its not a bug but Spark design. Keep in mind that your code is executed within same action so due Spark lazy nature he is smart enough to adjust/optimize some code during planning.

CodePudding user response:

The first answer is obviously correct, and whether or not the Spark approach is a good implementation is open to debate - I think it is.

As an embellishment: A checkpoint, if used, will mean an error:

spark.sparkContext.setCheckpointDir("/foo2/bar")
new_range = new_range.checkpoint()
new_range.sort('number').show() 

returns:

AnalysisException: Column 'number' does not exist. Did you mean one of the following? [num2];
'Sort ['number ASC NULLS FIRST], true
 - LogicalRDD [num2#69L], false
  • Related