I am new with Azure databricks, and I need help. I'm a little confused with everything. How Databricks use Azure filesystem events? What are Azure filesystem events? In which way Databricks can check for missed events? Thank you for help.
CodePudding user response:
Databricks has a functionality called Auto Loader - it allows efficiently load data from files on the ADLS, Azure blob storage, or other cloud storage systems. Although OSS Spark supports loading files from cloud storage as well, but it's doing it only by listing of the files, and it could be very slow when you have a lot of files in directory(-ies). Auto Loader also supports discovery of the data by listing files, and it's more optimized that standard Spark.
But Auto Loader has much more powerful new files discovery mode - by using File Notifications on the level of the cloud storage. In this case, data ingestion process receives names of the new files directly, without need to list them, and as result it's much faster and efficient. Regarding missing events, Auto Loader also has an async backfill mechanism that will check for files for which events were missing.