This is my dataframe
Score Features
74.5 {description={termFrequency=4.0, similarityScore=37.8539953, uniqueTokenMatches=4.0},
code={termFrequency=4.0, similarityScore=36.7476063, uniqueTokenMatches=4.0}}
77.64 {description={termFrequency=3.0, similarityScore=36.080687, uniqueTokenMatches=3.0},
code={termFrequency=3.0, similarityScore=34.2332495, uniqueTokenMatches=3.0}}
In features column I want to extract only description dictionary no need to extract code dictionary but the type of the features column is string and I don’t want to use substr() for extract that. How I can do this by using python pyspark.
I want output dataframe like
Score termFrequency similarityScore uniqueTokenMatches
74.5 4.0 37.8539953 4.0
77.64 3.0 36.080687 3.0
CodePudding user response:
May be this can be further optimised , however the general idea of this answer is extract the part of the string representation of the dictionary you need, then split by the delimiters and do some cleaning to create an array of structs, explode and pivot them to create new columns.
Imports:
from pyspark.sql import functions as F
Code:
out = (df.withColumn("Features",
F.split(
F.regexp_replace(
F.regexp_extract("Features","(?:\{description=)(\{. }),",1)
,"\{|\}|\s ","")
,",")
)
.withColumn("Features",F.expr("""transform(
transform(Features,x-> split(x,'='))
,y->struct(y[0],cast(y[1] as float)))"""))
.selectExpr("Score","inline(Features)")
.groupBy("Score").pivot("col1").agg({"col2":'first'})
)
out.show()
----- --------------- ------------- ------------------
|Score|similarityScore|termFrequency|uniqueTokenMatches|
----- --------------- ------------- ------------------
| 74.5| 37.853996| 4.0| 4.0|
|77.64| 36.08069| 3.0| 3.0|
----- --------------- ------------- ------------------