Home > OS >  I am new to pyspark. how do i do the following using pyspark?
I am new to pyspark. how do i do the following using pyspark?

Time:05-11

I have a spark dataframe as below:

 ----------- -------------------- ----- 
|      Index|                 lst|value|
 ----------- -------------------- ----- 
|          1|[3,5,6,7]           |    1|
|          2|[2,6,8,1,2,3,4,5,7] |    5|
|          3|[5,6,7,5,4,3,2,3,1] |    2|
|          4|[8,7,6,4,3,2,3]     |    6|
 ----------- -------------------- -----  

I need to get the following:

 ----------- -------------------- ----- ----------- 
|      Index|                 lst|value|index_value|
 ----------- -------------------- ----- ----------- 
|          1|[3,5,6,7]           |    1|          5|
|          2|[2,6,8,1,2,3,4,5,7] |    5|          3|
|          3|[5,6,7,5,4,3,2,3,1] |    2|          7|
|          4|[8,7,6,4,3,2,3]     |    6|          3|
 ----------- -------------------- ----- ----------- 

I have used a udf to try and get the values, but was not able to successfully implement this. I know it is a very basic question, but I was able to do the required using pandas, but need to get the task done using pyspark. This example is a sample of the data that I have.

CodePudding user response:

If you are essentially trying to get the value from lst based on the index value specified in value , you can achieve this by getItem which is pyspark native to indexing ArrayType

Data Preparation

s = StringIO("""
Index|lst|value
1|3,5,6,7|1
2|2,6,8,1,2,3,4,5,7|5
3|5,6,7,5,4,3,2,3,1|2
4|8,7,6,4,3,2,3|6
""")

df = pd.read_csv(s,delimiter='|')

sparkDF = sql.createDataFrame(df)

sparkDF = sparkDF.withColumn("lst", F.split(F.col("lst"), ",").cast("array<int>"))

sparkDF.show()

 ----- -------------------- ----- 
|Index|                 lst|value|
 ----- -------------------- ----- 
|    1|        [3, 5, 6, 7]|    1|
|    2|[2, 6, 8, 1, 2, 3...|    5|
|    3|[5, 6, 7, 5, 4, 3...|    2|
|    4|[8, 7, 6, 4, 3, 2...|    6|
 ----- -------------------- ----- 

sparkDF.printSchema()

root
 |-- Index: long (nullable = true)
 |-- lst: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- value: long (nullable = true)

GetItem

sparkDF = sparkDF.withColumn("index_value", F.col("lst").getItem(F.col('value')))

sparkDF.show(truncate=False)

 ----- --------------------------- ----- ----------- 
|Index|lst                        |value|index_value|
 ----- --------------------------- ----- ----------- 
|1    |[3, 5, 6, 7]               |1    |5          |
|2    |[2, 6, 8, 1, 2, 3, 4, 5, 7]|5    |3          |
|3    |[5, 6, 7, 5, 4, 3, 2, 3, 1]|2    |7          |
|4    |[8, 7, 6, 4, 3, 2, 3]      |6    |3          |
 ----- --------------------------- ----- ----------- 
  • Related