Home > Enterprise >  Extracting Specific Field from String in Scala
Extracting Specific Field from String in Scala

Time:09-17

My dataframe returns the below result as String.

  QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}   |

  QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}

I just want

"cbcnt":0  <-- Numeric part of this

Expected Output

col
----
0
2021-07-30

Tried:

.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d )""", 1)) 

Output

 col
    ----
    0
    "2021-07-30 00:00:00   --<--additional " is coming

CodePudding user response:

Extract via regex:

val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}   |"
val regex = """"cbcnt":(\d )""".r.unanchored
val s"${regex(result)}" = value

println(result)

Output:

0

CodePudding user response:

Using the Pyspark function regexp_extract:

from pyspark.sql import functions as F

df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d )""", 1)).show()
  • Related