I'm reading in data (as show below) into a list of lists, and I want to convert it into a dataframe with seven columns. The error I get is: requirement failed: number of columns doesn't match. Old column names (1): value, new column names (7): <list of columns>
What am I doing incorrectly and how can I fix it?
Data:
Column1, Column2, Column3, Column4, Column5, Column6, Column7
a,b,c,d,e,f,g
a2,b2,c2,d2,e2,f2,g2
Code:
val spark = SparkSession.builder.appName("er").master("local").getOrCreate()
import spark.implicits._
val erResponse = response.body.toString.split("\\\n")
val header = erResponse(0)
val body = erResponse.drop(1).map(x => x.split(",").toList).toList
val erDf = body.toDF()
erDf.show()
CodePudding user response:
You get this number of columns doesn't match
error because your erDf
dataframe contains only one column, that contains an array:
----------------------------
|value |
----------------------------
|[a, b, c, d, e, f, g] |
|[a2, b2, c2, d2, e2, f2, g2]|
----------------------------
You can't match this unique column with the seven columns contained in your header.
The solution here is, given this erDf
dataframe, to iterate over your header columns list to build columns one by one. Your complete code thus become:
val spark = SparkSession.builder.appName("er").master("local").getOrCreate()
import spark.implicits._
val erResponse = response.body.toString.split("\\\n")
val header = erResponse(0).split(", ") // build header columns list
val body = erResponse.drop(1).map(x => x.split(",").toList).toList
val erDf = header
.zipWithIndex
.foldLeft(body.toDF())((acc, elem) => acc.withColumn(elem._1, col("value")(elem._2)))
.drop("value")
That will give you the following erDf
dataframe:
------- ------- ------- ------- ------- ------- -------
|Column1|Column2|Column3|Column4|Column5|Column6|Column7|
------- ------- ------- ------- ------- ------- -------
| a| b| c| d| e| f| g|
| a2| b2| c2| d2| e2| f2| g2|
------- ------- ------- ------- ------- ------- -------