My dataframe returns the below result as String
.
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}
I just want
"cbcnt":0 <-- Numeric part of this
Expected Output
col
----
0
2021-07-30
Tried:
.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d )""", 1))
Output
col
----
0
"2021-07-30 00:00:00 --<--additional " is coming
CodePudding user response:
Extract via regex:
val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |"
val regex = """"cbcnt":(\d )""".r.unanchored
val s"${regex(result)}" = value
println(result)
Output:
0
CodePudding user response:
Using the Pyspark function regexp_extract:
from pyspark.sql import functions as F
df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d )""", 1)).show()