Median of an array column in spark or pandas all rows simultaneously-CodePudding

Strangely enough I cant find any where on the internet if its possible to be done.

I have a datafrme of array column.

arr_col
[1,3,4]
[4,3,5]

I want result

Result
3
4

I want the median for each row.

I managed to do it with a pandas udf but it iterates the column and applies np.median to each row. .

I dont want it as it's slow and tow at a time. I want it to act at all rows the same time.

Either in pandas or pyspark

CodePudding user response：

Use numpy

import numpy as np
df['Result'] = np.median(np.vstack(df['arr_col']), axis=1)

Or explode and groupby.median:

df['Result'] = (df['arr_col'].explode()
                 .groupby(level=0).median()
                )

Output:

     arr_col  Result
0  [1, 3, 4]     3.0
1  [4, 3, 5]     4.0

Used input:

df = pd.DataFrame({'arr_col': [[1,3,4], [4,3,5]]})

CodePudding user response：

Can use a udf in pyspark.

m =udf(lambda x: int(np.median(x)),IntegerType())
df.withColumn('Result', m(col('arr_col'))).show()

 --- --------- ------ 
| Id|  arr_col|Result|
 --- --------- ------ 
|  1|[1, 3, 4]|   3.0|
|  1|[4, 3, 6]|   4.0|
 --- --------- ------