Home > Blockchain >  pyspark read key value pair
pyspark read key value pair

Time:03-08

I have a space delimeted csv file containing key value pair like below :-

a=1 b="2021-02-21 00:00:12" c=5

I have to read it through pyspark.

I used below command :-

a=spark.read.option("quote",""").option('delimiter',' ').csv(/path/abc.csv)

when I execute above, its creating 4 columns instead of 3, like below :-

a=1| b="2021-02-21| 00:00:12"| c=5

Can anyone help how can I put whole values of b in single column.

NOTE: this is just a sample record. Actual records multiple k-v pairs and also ordering of columns are not same in every files that we receive.

CodePudding user response:

You can use concat_ws to concatenate the two columns and df.columns[n] to access the columns by index

from pyspark.sql.functions import concat_ws
df = df.select(concat_ws('',df.columns[1],df.columns[2]).alias("b"),"a","c")

CodePudding user response:

Since you cannot use a regular expression, to separate the columns based on the key value pairs you could read the hole line and use the F.split function together with some F.pivot magic.

(spark.read.text("test.csv")
 .withColumn("line_id", F.monotonically_increasing_id())
 .withColumn("array", F.split("value",r"( )?[a-z]="))
 .withColumn("keys", F.array([F.lit(x) for x in list("abc")]))
 .select(F.col("line_id"),
         F.arrays_zip(F.slice(F.col("keys"),1,3),
                      F.slice(F.col("array"),2,4)).alias("value"))
 .select(F.col("line_id"),F.explode("value").alias("exploded"))
 .select(F.col("line_id"),
         F.col("exploded").getItem("0").alias("key"), 
         F.col("exploded").getItem("1").alias("val"))
 .groupBy("line_id").pivot("key").agg(F.max("val"))
 .show(10, False))

that should result in something like:

 ------- --- --------------------- --- 
|line_id|a  |b                    |c  |
 ------- --- --------------------- --- 
|0      |1  |"2021-02-21 00:00:12"|5  |
|1      |3  |"2021-02-21 00:00:52"|4  |
 ------- --- --------------------- --- 
  • Related