I have a spark dataframe as below:
----------- -------------------- -----
| Index| lst|value|
----------- -------------------- -----
| 1|[3,5,6,7] | 1|
| 2|[2,6,8,1,2,3,4,5,7] | 5|
| 3|[5,6,7,5,4,3,2,3,1] | 2|
| 4|[8,7,6,4,3,2,3] | 6|
----------- -------------------- -----
I need to get the following:
----------- -------------------- ----- -----------
| Index| lst|value|index_value|
----------- -------------------- ----- -----------
| 1|[3,5,6,7] | 1| 5|
| 2|[2,6,8,1,2,3,4,5,7] | 5| 3|
| 3|[5,6,7,5,4,3,2,3,1] | 2| 7|
| 4|[8,7,6,4,3,2,3] | 6| 3|
----------- -------------------- ----- -----------
I have used a udf to try and get the values, but was not able to successfully implement this. I know it is a very basic question, but I was able to do the required using pandas, but need to get the task done using pyspark. This example is a sample of the data that I have.
CodePudding user response:
If you are essentially trying to get the value from lst
based on the index value specified in value
, you can achieve this by getItem which is pyspark native to indexing ArrayType
Data Preparation
s = StringIO("""
Index|lst|value
1|3,5,6,7|1
2|2,6,8,1,2,3,4,5,7|5
3|5,6,7,5,4,3,2,3,1|2
4|8,7,6,4,3,2,3|6
""")
df = pd.read_csv(s,delimiter='|')
sparkDF = sql.createDataFrame(df)
sparkDF = sparkDF.withColumn("lst", F.split(F.col("lst"), ",").cast("array<int>"))
sparkDF.show()
----- -------------------- -----
|Index| lst|value|
----- -------------------- -----
| 1| [3, 5, 6, 7]| 1|
| 2|[2, 6, 8, 1, 2, 3...| 5|
| 3|[5, 6, 7, 5, 4, 3...| 2|
| 4|[8, 7, 6, 4, 3, 2...| 6|
----- -------------------- -----
sparkDF.printSchema()
root
|-- Index: long (nullable = true)
|-- lst: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- value: long (nullable = true)
GetItem
sparkDF = sparkDF.withColumn("index_value", F.col("lst").getItem(F.col('value')))
sparkDF.show(truncate=False)
----- --------------------------- ----- -----------
|Index|lst |value|index_value|
----- --------------------------- ----- -----------
|1 |[3, 5, 6, 7] |1 |5 |
|2 |[2, 6, 8, 1, 2, 3, 4, 5, 7]|5 |3 |
|3 |[5, 6, 7, 5, 4, 3, 2, 3, 1]|2 |7 |
|4 |[8, 7, 6, 4, 3, 2, 3] |6 |3 |
----- --------------------------- ----- -----------