Home > database >  How does Pyspark decides data type of a column automatically when inferschema is set to True, What h
How does Pyspark decides data type of a column automatically when inferschema is set to True, What h

Time:06-14

According to documentation,

inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default

alright, I understood that spark will read the CSV to determine the data type and assigns accordingly.

I am curious to know what is happening in the background.

  1. Does spark scans whole csv?
  2. if it scans only a sample data, then how many rows will it scan?
  3. How does spark conclude that so and so column is of a particular datatype and assigns it on inferSchema = true?

Can someone help me to understand it better or share some links!

Thank you.

CodePudding user response:

Answering some of your questions

  1. By default yes but sampling ratio was introduced in new version where you can define the fraction of values which need to be scanned to infer schema (default is 1)
  2. By default all rows as it mentions it will need one extra pass over data
  3. it parses it each as integer, long, double, bool and finally string or exception if parsing fails and gives the final schema you can read an early version of code here
  • Related