Strangely enough I cant find any where on the internet if its possible to be done.
I have a datafrme of array column.
arr_col
[1,3,4]
[4,3,5]
I want result
Result
3
4
I want the median for each row.
I managed to do it with a pandas udf but it iterates the column and applies np.median to each row. .
I dont want it as it's slow and tow at a time. I want it to act at all rows the same time.
Either in pandas or pyspark
CodePudding user response:
Use numpy
import numpy as np
df['Result'] = np.median(np.vstack(df['arr_col']), axis=1)
Or explode
and groupby.median
:
df['Result'] = (df['arr_col'].explode()
.groupby(level=0).median()
)
Output:
arr_col Result
0 [1, 3, 4] 3.0
1 [4, 3, 5] 4.0
Used input:
df = pd.DataFrame({'arr_col': [[1,3,4], [4,3,5]]})
CodePudding user response:
Can use a udf in pyspark.
m =udf(lambda x: int(np.median(x)),IntegerType())
df.withColumn('Result', m(col('arr_col'))).show()
--- --------- ------
| Id| arr_col|Result|
--- --------- ------
| 1|[1, 3, 4]| 3.0|
| 1|[4, 3, 6]| 4.0|
--- --------- ------