I have a TEXT file with 4 fields and 3rd field is JSON string which I want to extract and create a separate column in dataframe.
pk,line,json,date
DBG,CDL,{"line":"CDL","stn":"DBG","latitude":"12.298915","longitude":"143.846263","isInterchange":true,"isIncidentStn":false,"stnKpis":[{"code":"PCD_PCT","value":0.1,"valueCreatedTs":1667361600000,"confidence":"50.0",}]},20221102
spark version: 2.4 python version: 3.6
CodePudding user response:
You can read the csv file using pyspark into a dataframe.
df = spark.read.csv("/tmp/resources/zipcodes.csv")
Then
json_string = json.loads(df.iloc["json"])
CodePudding user response:
Data
df =spark.createDataFrame([('DBG','CDL',{"line":"CDL","stn":"DBG","latitude":"12.298915","longitude":"143.846263","isInterchange":'true',"isIncidentStn":'false',"stnKpis":[{"code":"PCD_PCT","value":0.1,"valueCreatedTs":1667361600000,"confidence":"50.0",}]},'20221102')],
('pk','line','json','date'))
##Every df has an underlying rdd, select column into df and send it to rdd
rdd=df.select(col("json").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
#Read rdd
schema spark.read.json(rdd).show()