I have data from csv file, and use it in jupyter notebook with pysaprk. I have many columns and all of them have string data type. I know how to change data type manually, but there is any way to do it automatically?
CodePudding user response:
You can use the inferSchema
option when you load your csv file, to let spark try to infer the schema. With the following example csv file, you can get two different schemas depending on whether you set inferSchema
to true or not:
seq,date
1,13/10/1942
2,12/02/2013
3,01/02/1959
4,06/04/1939
5,23/10/2053
6,13/03/2059
7,10/12/1983
8,28/10/1952
9,07/04/2033
10,29/11/2035
Example code:
df = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "false") # default option
.load(path))
df.printSchema()
df2 = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path))
df2.printSchema()
Output:
root
|-- seq: string (nullable = true)
|-- date: string (nullable = true)
root
|-- seq: integer (nullable = true)
|-- date: string (nullable = true)
CodePudding user response:
You would need to define the schema before reading the file:
from pyspark.sql import functions as F
from pyspark.sql.types import *
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.show()
df.printSchema()
--------- ---------- -------- ----- ------ ------
|firstname|middlename|lastname| id|gender|salary|
--------- ---------- -------- ----- ------ ------
| James| | Smith|36636| M| 3000|
| Michael| Rose| |40288| M| 4000|
| Robert| |Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| | F| -1|
--------- ---------- -------- ----- ------ ------
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = true)