I have a data frame that looks like this:
------- ------- --------------------
| user| item| ls_rec_items|
------- ------- --------------------
| 321| 3| [4, 3, 2, 6, 1, 5]|
| 123| 2| [5, 6, 3, 1, 2, 4]|
| 123| 7| [5, 6, 3, 1, 2, 4]|
------- ------- --------------------
I want to know in which position the "item" is in the "ls_rec_items" array.
I know the function array_position, but I don't know how to get the "item" value there.
I know this:
df.select(F.array_position(df.ls_rec_items, 3)).collect()
But I want this:
df.select(F.array_position(df.ls_rec_items, df.item)).collect()
The output should look like this:
------- ------- -------------------- -----
| user| item| ls_rec_items| pos|
------- ------- -------------------- -----
| 321| 3| [4, 3, 2, 6, 1, 5]| 2|
| 123| 2| [5, 6, 3, 1, 2, 4]| 5|
| 123| 7| [5, 6, 3, 1, 2, 4]| 0|
------- ------- -------------------- -----
CodePudding user response:
You could use expr
with array_position
like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
data = [
{"user": 321, "item": 3, "ls_rec_items": [4, 3, 2, 6, 1, 5]},
{"user": 123, "item": 2, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
{"user": 123, "item": 7, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
]
df = spark.createDataFrame(data)
df = df.withColumn("pos", F.expr("array_position(ls_rec_items, item)"))
Result
---- ------------------ ---- ---
|item| ls_rec_items|user|pos|
---- ------------------ ---- ---
| 3|[4, 3, 2, 6, 1, 5]| 321| 2|
| 2|[5, 6, 3, 1, 2, 4]| 123| 5|
| 7|[5, 6, 3, 1, 2, 4]| 123| 0|
---- ------------------ ---- ---