Good morning,
I have the following variables.
self.filters = 'px_variation > 0.15'
df
If I do df.collect() I get.
Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])
I try to apply the following function
df.filter(self.filters)
And it's result is.
Row(px_variation=nan, subject_code='1010', list_tr_id=['X0', 'X1'], list_quantity=[3000.0, 1.0], list_cash_qty=[16500.0, 5.5])
As you can see px_variation on my DF is a numpy.nan but after appliying filter function it doesn't filter it. Why isn't spark sql ignoring nan or using it to be filtered?
If I do the same operation but in python the result is as expected.
df.collect()[0].px_variation > 0.15 -> Result: False
Any idea? Thanks you.
CodePudding user response:
The special value NaN
is treated as
larger than any other numeric value.
by Spark's nan-semantics, even "larger" than infinity
.
One option is to change the filter to
filters = 'px_variation > 0.15 and not isnan(px_variation)'
Another option to handle the NaN
values is to replace them with None
/null
:
df.replace(float('nan'), None).filter('px_variation > 0.15')