sparkSQL filter function not working with NaN-CodePudding

Good morning,

I have the following variables.

self.filters = 'px_variation > 0.15'
df

If I do df.collect() I get.

Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])

I try to apply the following function

df.filter(self.filters)

And it's result is.

Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])

As you can see px_variation on my DF is a numpy.nan but after appliying filter function it doesn't filter it. Why isn't spark sql ignoring nan or using it to be filtered?

If I do the same operation but in python the result is as expected.

df.collect()[0].px_variation > 0.15 -> Result: False

Any idea? Thanks you.

CodePudding user response：

The special value NaN is treated as

larger than any other numeric value.

by Spark's nan-semantics, even "larger" than infinity.

One option is to change the filter to

filters = 'px_variation > 0.15 and not isnan(px_variation)'

Another option to handle the NaN values is to replace them with None/null:

df.replace(float('nan'), None).filter('px_variation > 0.15')