Create a parquet file with custom schema-CodePudding

I have a requirement like this:

In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.

We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.

I have to create a parquet file with the schema that is coming from the API.

How can we do this in Databricks using PySpark.

CodePudding user response：

You can always pass in the schema when reading:

schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
df = spark.read.csv('input.csv', schema=schema)
df.printSchema()
df.show()

The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".

input.csv:

"name","123.1234","2022-01-01 10:10:00"