Home > Software engineering >  Column names appearing as record data in Pyspark databricks
Column names appearing as record data in Pyspark databricks

Time:10-21

I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code (Raw CSV data only partial data)

enter image description here

#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........

The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code

from pyspark.sql.types import *

df1 = spark.read.format("csv") \
    .option("inferschema", "true") \
    .option("header", "true") \
    .load("dbfs:/FileStore/shared_uploads/[email protected]/Covid_Live.csv") \
    .select("*")
Spark Jobs --> 
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string

As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns

enter image description here

How do i tackle this malformation ?

CodePudding user response:

Few ways to go about this.

  1. Fix the header in the csv before reading (should be on a single line). Also pay attention to quoting and escape settings.
  2. Read in PySpark with manually provided schema and filter out the bad lines.
  3. Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.

CodePudding user response:

So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts. I just had to add .option("multiLine","true") \ and the data is displaying as desired!

  • Related