Azure/Databricks - Best way to ingest data?-CodePudding

I'm new to Azure & Databricks. I've been watching training videos and do have cloud experience with AWS. However I'm on a time crunch so help would be appreciated. I have multiple data sources I need live data ingest (via API calls/database connection) into azure and run transformations/ML in Databricks. I will likely need to output the cleaned dataframe(s) into a DW or sql database that will have a BI connection. If someone with experience in Azure Databricks can help with recommending which products I need, that would be terrific. Note this is not 'big data' (Only 100,000 rows max) but will need a compute capacity to run ML (NLP) quickly.

1. ELT/ETL - Should I go Datafactory -> Databricks. Or maybe Kafka -> blob storage -> Databricks?
2. Recommended worker type size for live data processing / NLP application?

CodePudding user response：

As it is begining of the project, the simplest way is just to write notebooks in Databricks and connect to source and Load data to dbfs storage than process that data again in Databricks (ML etc).

If it is small dataset just take simplest 1 worker driver for that.

In future you can always upgrade worker type and set to run notebooks jobs via Data Factory.

CodePudding user response：

Always ELT approach is recommended One rather than ETL in modern data handling. Data should be landed in datalake before transforming into business use case. This is always recommended pattern any distributed systems based tools..in You can refer link