I am looking for guidance/best practice to approach a task. I want to use Azure-Databricks and PySpark.
Task: Load and prepare data so that it can be efficiently/quickly analyzed in the future. The analysis will involve summary statistics, exploratory data analysis and maybe simple ML (regression). Analysis part is not clearly defined yet, so my solution needs flexibility in this area.
Data: session level data (12TB) stored in 100 000 single line JSON files. JSON schema is nested, includes arrays. JSON schema is not uniform but new fields are added over time - data is a time-series.
Overall, the task is to build an infrastructure so the data can be processed efficiently in the future. There will be no new data coming in.
My initial plan was to:
Load data into blob storage
Process data using PySpark
- flatten by reading into data frame
- save as parquet (alternatives?)
Store in a DB so the data can be quickly queried and analyzed
- I am not sure which Azure solution (DB) would work here
- Can I skip this step when data is stored in efficient format (e.g. parquet)?
Analyze the data using PySpark by querying it from DB (or from blob storage when in parquet)
Does this sound reasonable? Does anyone has materials/tutorials that follow similar process so I could use them as blueprints for my pipeline?
CodePudding user response:
Yes, it's sound reasonable, and in fact it's quite standard architecture (often referred as lakehouse). Usual implementation approach is following:
JSON data loaded into blob storage are consumed using Databricks Auto Loader that provides efficient way of ingesting only new data (since previous run). You can trigger pipeline regularly, for example, nightly, or run it continuously if data arriving all the time. Auto Loader is also handling schema evolution of input data.
Processed data is better to store as Delta Lake tables that provide better performance than "plain" Parquet due use of additional information in the transaction log so it's possible to efficiently access only necessary data. (Delta Lake is built on top of Parquet, but has more capabilities).
Processed data then could be accessed via Spark code, or via Databricks SQL (it could be more efficient for reporting, etc., as it's heavily optimized for BI workloads). Due the big amount of data, storing them in some "traditional" database may not be very efficient or be very costly.
P.S. I would recommend to look on implementing this with Delta Live Tables that may simplify development of your pipelines.
Also, you may have access to Databricks Academy that has introductory courses about lakehouse architecture and data engineering patterns. If you don't have access to it, you can at least look to Databricks courses published on GitHub.