I have a requirement like this:
In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.
We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.
I have to create a parquet file with the schema that is coming from the API.
How can we do this in Databricks using PySpark.
CodePudding user response:
You can always pass in the schema when reading:
schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
df = spark.read.csv('input.csv', schema=schema)
df.printSchema()
df.show()
The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".
input.csv:
"name","123.1234","2022-01-01 10:10:00"