Home > Net >  How to filter data using spark.read in place?
How to filter data using spark.read in place?

Time:09-30

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format

query = f"""
    select * from {table_name} 
    where 
       createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
       modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
    """

return spark.read \
    .format("jdbc") \
    .option("url", url) \
    .option("query", query) \
    .option("user", username) \
    .option("password", password) \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()

So I tried to create in similar way reading delta using query but it reads whole table.

return spark.read \
    .format("delta") \
    .option("query", query) \
    .load(path)

How could I solve this issue without reading full df and then filter it?

Thanks in advance!

CodePudding user response:

Spark uses a functionality called predicate pushdown to optimize queries. In the first case, the filters can be passed on to the oracle database.

Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

  • Related