Extract a JSON string from a Text File using pyspark-CodePudding

I have a TEXT file with 4 fields and 3rd field is JSON string which I want to extract and create a separate column in dataframe.

pk,line,json,date
DBG,CDL,{"line":"CDL","stn":"DBG","latitude":"12.298915","longitude":"143.846263","isInterchange":true,"isIncidentStn":false,"stnKpis":[{"code":"PCD_PCT","value":0.1,"valueCreatedTs":1667361600000,"confidence":"50.0",}]},20221102

spark version: 2.4 python version: 3.6

CodePudding user response：

You can read the csv file using pyspark into a dataframe.

df = spark.read.csv("/tmp/resources/zipcodes.csv")

Then

json_string = json.loads(df.iloc["json"])

CodePudding user response：

Data

df =spark.createDataFrame([('DBG','CDL',{"line":"CDL","stn":"DBG","latitude":"12.298915","longitude":"143.846263","isInterchange":'true',"isIncidentStn":'false',"stnKpis":[{"code":"PCD_PCT","value":0.1,"valueCreatedTs":1667361600000,"confidence":"50.0",}]},'20221102')],
                         ('pk','line','json','date'))

##Every df has an underlying rdd, select column into df and send it to rdd
rdd=df.select(col("json").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)

#Read rdd
schema spark.read.json(rdd).show()