Home > Net >  Create a parquet file with custom schema
Create a parquet file with custom schema

Time:07-31

I have a requirement like this:

In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.

We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.

I have to create a parquet file with the schema that is coming from the API.

How can we do this in Databricks using PySpark.

CodePudding user response:

You can always pass in the schema when reading:

schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
df = spark.read.csv('input.csv', schema=schema)
df.printSchema()
df.show()

The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".

input.csv:

"name","123.1234","2022-01-01 10:10:00"
  • Related