I have a space delimeted csv file containing key value pair like below :-
a=1 b="2021-02-21 00:00:12" c=5
I have to read it through pyspark.
I used below command :-
a=spark.read.option("quote",""").option('delimiter',' ').csv(/path/abc.csv)
when I execute above, its creating 4 columns instead of 3, like below :-
a=1| b="2021-02-21| 00:00:12"| c=5
Can anyone help how can I put whole values of b in single column.
NOTE: this is just a sample record. Actual records multiple k-v pairs and also ordering of columns are not same in every files that we receive.
CodePudding user response:
You can use concat_ws
to concatenate the two columns and df.columns[n]
to access the columns by index
from pyspark.sql.functions import concat_ws
df = df.select(concat_ws('',df.columns[1],df.columns[2]).alias("b"),"a","c")
CodePudding user response:
Since you cannot use a regular expression, to separate the columns based on the key value pairs you could read the hole line and use the F.split
function together with some F.pivot
magic.
(spark.read.text("test.csv")
.withColumn("line_id", F.monotonically_increasing_id())
.withColumn("array", F.split("value",r"( )?[a-z]="))
.withColumn("keys", F.array([F.lit(x) for x in list("abc")]))
.select(F.col("line_id"),
F.arrays_zip(F.slice(F.col("keys"),1,3),
F.slice(F.col("array"),2,4)).alias("value"))
.select(F.col("line_id"),F.explode("value").alias("exploded"))
.select(F.col("line_id"),
F.col("exploded").getItem("0").alias("key"),
F.col("exploded").getItem("1").alias("val"))
.groupBy("line_id").pivot("key").agg(F.max("val"))
.show(10, False))
that should result in something like:
------- --- --------------------- ---
|line_id|a |b |c |
------- --- --------------------- ---
|0 |1 |"2021-02-21 00:00:12"|5 |
|1 |3 |"2021-02-21 00:00:52"|4 |
------- --- --------------------- ---