How to extract string type dictionary using python pyspark?-CodePudding

This is my dataframe

Score   Features
74.5    {description={termFrequency=4.0, similarityScore=37.8539953, uniqueTokenMatches=4.0}, 
         code={termFrequency=4.0, similarityScore=36.7476063, uniqueTokenMatches=4.0}}
77.64   {description={termFrequency=3.0, similarityScore=36.080687, uniqueTokenMatches=3.0}, 
         code={termFrequency=3.0, similarityScore=34.2332495, uniqueTokenMatches=3.0}}

In features column I want to extract only description dictionary no need to extract code dictionary but the type of the features column is string and I don’t want to use substr() for extract that. How I can do this by using python pyspark.

I want output dataframe like

Score   termFrequency       similarityScore     uniqueTokenMatches
74.5    4.0                 37.8539953          4.0
77.64   3.0                 36.080687           3.0

CodePudding user response：

May be this can be further optimised , however the general idea of this answer is extract the part of the string representation of the dictionary you need, then split by the delimiters and do some cleaning to create an array of structs, explode and pivot them to create new columns.

Imports:

from pyspark.sql import functions as F

Code:

out = (df.withColumn("Features",
              F.split(
              F.regexp_replace(
              F.regexp_extract("Features","(?:\{description=)(\{. }),",1)
              ,"\{|\}|\s ","")
              ,",")
             )
.withColumn("Features",F.expr("""transform(
           transform(Features,x-> split(x,'='))
           ,y->struct(y[0],cast(y[1] as float)))"""))
.selectExpr("Score","inline(Features)")
.groupBy("Score").pivot("col1").agg({"col2":'first'})
)

out.show()

 ----- --------------- ------------- ------------------ 
|Score|similarityScore|termFrequency|uniqueTokenMatches|
 ----- --------------- ------------- ------------------ 
| 74.5|      37.853996|          4.0|               4.0|
|77.64|       36.08069|          3.0|               3.0|
 ----- --------------- ------------- ------------------