Home > OS >  Using Pyspark how to convert plain text to csv file
Using Pyspark how to convert plain text to csv file

Time:11-29

When I created a hive table, the data is as follows.

data file

<__name__>abc
<__code__>1
<__value__>1234
<__name__>abcdef
<__code__>2
<__value__>12345
<__name__>abcdef
<__code__>2
<__value__>12345
1234156321
<__name__>abcdef
<__code__>2
<__value__>12345
...

Can I create a table right away without converting the file? It's a plain text file, three columns are repeated.

How to convert dataframe? or csv file?

I want

| name   | code | value
| abc    | 1    | 1234 
| abcdef | 2    | 12345
...

or

abc,1,1234
abcdef,2,12345
...

CodePudding user response:

I solved my problem like this.

data = spark.read.text(path)

rows = data.rdd.zipWithIndex().map(lambda x: Row(x[0].value, int(x[1]/3)))

schema = StructType() \
      .add("col1",StringType(), False) \
      .add("record_pos",IntegerType(), False)
      
df = spark.createDataFrame(rows, schema)

df1 = df.withColumn("key", regexp_replace(split(df["col1"], '__>')[0], '<|__', '')) \
        .withColumn("value", regexp_replace(regexp_replace(split(df["col1"], '__>')[1], '\n', '<NL>'), '\t', '<TAB>'))

dataframe = df1.groupBy("record_pos").pivot("key").agg(first("value")).drop("record_pos")

dataframe.show()

CodePudding user response:

val path = "file:///C:/stackqustions/data/stackq5.csv"
val data = sc.textFile(path)

import spark.implicits._

val rdd = data.zipWithIndex.map {
    case (records, index) => Row(records, index / 3)
}

val schema = new StructType().add("col1", StringType, false).add("record_pos", LongType, false)
val df = spark.createDataFrame(rdd, schema)  
val df1 = df
    .withColumn("key", regexp_replace(split($"col1", ">")(0), "<|__", ""))
    .withColumn("value", split($"col1", ">")(1)).drop("col1")

df1.groupBy("record_pos").pivot("key").agg(first($"value")).drop("record_pos").show

result:

 ---- ------ ----- 
|code|  name|value|
 ---- ------ ----- 
|   1|   abc| 1234|
|   2|abcdef|12345|
|   2|abcdef|12345|
|   2|abcdef|12345|
 ---- ------ ----- 
  • Related