I had this line of code in python:
d = float(round(100.00 - (null_count / total) * 100, 2))
I wanted to convert it into PySpark code so I wrote this:
d = round((100.00-(null_count/total)*100).cast("float"), 2)
but this gives the error
'float' object has no attribute 'cast'
CodePudding user response:
In programming, you must know your data types (classes).
You wanted to use this cast
method:
Column.cast(dataType: Union[pyspark.sql.types.DataType, str]) → pyspark.sql.column.Column
You must know your data types (classes)
A.cast(B) → C
A
: The parent class of the method. It's pyspark.sql.column.Column
class (a.k.a. pyspark.sql.Column
).
B
: Inputs for the method. According to the above documentation line, you can use either pyspark.sql.types.DataType
or str
class.
C
: The output class. According to the above documentation line, it's pyspark.sql.column.Column
.
In your case, your actual A
is of wrong data type to be chained with cast
.
In other words, the class of A
, doesn't have a cast
method.
In other words, as your A
= number1-number2/number3*number4
which means it's a float
class object , the error precisely tells you that "'float' object has no attribute 'cast'".
Regarding the translation of your Python code to PySpark, it doesn't really make sense. It's because you do the calculation for variables. I mean, only 2 variables. The pyspark.sql.Column
objects are called columns, because they contain many different values. So you must create a dataframe (just columns are not enough for actual calculations) and put some values in columns in order to make sense of translating the formula to PySpark.
I'll just show you how it may work if you had just one row.
Creating Spark session (not needed if you run the code in PySpark shell):
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
Creating and printing the dataframe:
df = spark.createDataFrame([(2, 50)], ['null_count', 'total'])
df.show()
# ---------- -----
# |null_count|total|
# ---------- -----
# | 2| 50|
# ---------- -----
Adding a column using your logic, but working with Spark columns instead of Python variables.
df = df.withColumn('d', F.round(100 - F.col('null_count') / F.col('total') * 100, 2).cast('float'))
df.show()
# ---------- ----- ----
# |null_count|total| d|
# ---------- ----- ----
# | 2| 50|96.0|
# ---------- ----- ----
Python's round
was also replaced with PySpark's F.round
, because the argument to the function will now be Spark column expression (i.e. a column) as opposed to a single value or variable.