Home > Mobile >  Read spark data with column that clashes with partition name
Read spark data with column that clashes with partition name

Time:12-14

I have the following file paths that we read with partitions on s3

prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json

When I read these with pyspark

self.spark
    .read \
    .option("basePath", 'prefix') \
    .schema(self.schema) \
    .json(['company=abcd/service=xyz/date=2021-01-01/'])

All the files have the same schema and get loaded in the table as rows. A file could be something like this:

{"id": "foo", "color": "blue", "date": "2021-12-12"}

The issue is that sometimes the files have the date field that clashes with my partition code, like date. So I want to know if it is possible to load the files without the partition columns, rename the JSON date column and then add the partition columns.

Final table would be:

| id  | color | file_date  | company | service | date       |
-------------------------------------------------------------
| foo | blue  | 2021-12-12 | abcd    | xyz     | 2021-01-01 |
| bar | red   | 2021-10-10 | abcd    | xyz     | 2021-01-01 |
| baz | green | 2021-08-08 | abcd    | xyz     | 2021-01-01 |

CodePudding user response:

Easiest would be to simply change the partition column name. You can then read in the data and rename the columns as you wish. You'll not lose the benefits of partitioning either.

If that is not an option you could read in the jsons using a wildcard for the partitions, rename the date column to 'file_date' and then add the partition date by extracting it from the filename. You can get the filename from input_file_name in pyspark.sql.functions.

Edit: I missed you have other partitioned columns before the date, you'd have to extract them from the filename as well making it less ideal.

CodePudding user response:

Yes, we can read all the json files without partition columns. Directly use the parent folder path and it will load all partitions data into the data frame.

After reading the data frame, you can use withColumn() function to rename the date field.

Something like below might work

df= spark.read.json("s3://bucket/table/**/*.json")

renamedDF= df.withColumnRenamed("old column name","new column name")

  • Related