According to documentation,
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
alright, I understood that spark will read the CSV to determine the data type and assigns accordingly.
I am curious to know what is happening in the background.
- Does spark scans whole csv?
- if it scans only a sample data, then how many rows will it scan?
- How does spark conclude that so and so column is of a particular datatype and assigns it on inferSchema = true?
Can someone help me to understand it better or share some links!
Thank you.
CodePudding user response:
Answering some of your questions
- By default yes but sampling ratio was introduced in new version where you can define the fraction of values which need to be scanned to infer schema (default is 1)
- By default all rows as it mentions it will need one extra pass over data
- it parses it each as integer, long, double, bool and finally string or exception if parsing fails and gives the final schema you can read an early version of code here