I have a column of arrays made of numbers, ie [0,80,160,220], and would like to create a column of arrays of the differences between adjacent terms, ie [80,80,60]
Does anyone have an idea how to do this in Python/PySpark? my code is df=df.withcolumn('col_array_diffs', [df.col_array.getItem[i]-df.col_array.getItem[i-1] if i else None for i in range(1,F.size(df.col_array))])
but am really struggling with the arraytype. This produces AssertionError: col should be Column...Thanks!
CodePudding user response:
You can use a UDF to do this.
import pyspark.sql.types as T
def subtract_el(x):
return [abs(i-j) for i, j in list(zip(x, x[1:]))]
df = spark.createDataFrame(pd.DataFrame([[[0,80,160,220]]]))
df.select(F.udf(subtract_el, T.ArrayType(T.IntegerType()))("0").alias("diff")).show()
Results in :
------------
| diff|
------------
|[80, 80, 60]|
------------