First of all I am quite beginner in Stream analytics and processing.
I am exploring Azure Stream Analytics and came across various design and most of time flow is like that.
Stream of Data ( Azure Eventhub ) -> Configure As input to ASA -> Some query -> Output to destination like Sql Server, Power BI , event hub.
Now let's say now there is 10000 events coming in minute and it is being processed by ASA with help of some window function like Tumbling window or so. Then out to Azure Sql DAtabase. Now here if we look at there may be many insert in Sql Server and it contains too many data as data coming in.
Is this good solution to use SQL Server ? If during a day there are too many events then don't you think it will become performance bottleneck.
Any suggestion / guidance or example will be greatly appreciated.
CodePudding user response:
Yes, Azure SQL DB will have no problem ingesting that much data (in fact can go as high as millions of rows per second). Here's a Learn path that you may find interesting: https://docs.microsoft.com/en-us/learn/modules/deploy-iot-solution-azure-sql-database/2-analyze-solution-architecture
CodePudding user response:
There are multiple aspects to consider here.
ASA can be used for event-by-event processing (for ETL-like ingestion but in real time) but where it really shines is with aggregations and time windowing (for real time analytics and reporting). These operations can significantly reduce the amount of data that is inserted in the database.
Using ASA to reduce the output volume
If your use case is along those lines, you can go from 10000 events per minutes (all devices in your fleet) to 5 per hour (average of metrics per hour over 5 factories). You can always add a second output that will generate an alert if a specific metric stays above a threshold for more than 5 seconds (1 alert per 30 seconds expected).
So any scale discussion needs to be made in the context of the business case. For the above you may need a medium sized job to process data at 10K events per minute, but any entry level SKU will work for the database.
SQL Outputs do scale
Now let's say that your use case requires 10000 records being inserted by minute in a SQL database.
This article discusses the performance implication of targeting Azure SQL. There you'll see that if the destination table is partitioned in alignment with the input and query, and you stay in the sweet spot for the batch size of the bulk insert, it should handle the load fine. Here it's less an issue of capability, and more of performance/price. In most cases you will be able to scale up the database until it handles the load, but it may become too expensive to operate.
An alternative is to target a Synapse SQL Pool (formerly SQL DWH), which scales easily up to 200MB/sec. It's a distributed system, like ASA, so it will scale up more elegantly and with a better performance/price ratio (but the starting point is higher, and you need to think about distribution).
Other outputs
As a commenter pointed out above, if you don't need the relational aspect of a SQL database, you may want to explore switching to a Cosmos DB output. Cosmos DB is a database that is distributed (scales very well with ASA), and stores documents instead of rows. The main benefit here is that you won't have data types issues writing into that database, they're not enforced, contrary to a SQL database.
But this comes with a suite of other aspects you will need to learn about, so if you're already comfortable with a SQL database, you may well want to stick to it at first.