Large Sensor Data best option. Tables SQL Vs Azure Table-CodePudding

I would like some advice on the best option, as I am having slow retrieval( over 30 seconds) of data on my web-based API.

I have multiple IoT Sensors(over 100), which will also be growing that TX updated sensor data to my IoT Hub, which then gets saved onto database or storage.

Previously I used to save all my sensor data to a SQL table, but as data grew, I found it was getting very slow, therefore I used Azure Table Storage. Each Sensor has got its own individual table storage, the partition key is the month and year(e.g. 202012), and the row key is a timestamp(e.g. 0002518033824243332546).

This proved to be much faster, as the amount of sensor data had reduced, as each sensor has its own table, but as the table grows for any particular sensor and I need to retrieve data across a longer period of time(1 month) this becomes very slow again. Each sensor TX's an update every 1 min, therefore each day produces 1400 records and 1 month would have about 44,640 records.

Is there any better solution for my requirement?

Would having an individual SQL table for each sensor be a good idea? How many tables can there be in SQL storage?

Thank You

CodePudding user response：

You can consider using Azure SQL Database (premium tiers P6 or greater) to store IoT data coming from a cold path (storage accounts) or coming from hot paths that needs to be saved directly to the database.

First consider to optimize data storage by using column stored indexes as mentioned here. With that you will compress data up to 20x and retrieve more data into memory quickly.

CREATE TABLE Telemetry (
    _id bigint IDENTITY,
    sensorData nvarchar(max),
   INDEX cci CLUSTERED COLUMNSTORE
)

Second you can consider use In-Memory OLTP and JSON functions in Transact-SQL. With that you can treat data formatted as JSON as any other SQL data type, and extract values from the JSON text to use JSON data in the SELECT list or in a search predicate. As Columnstore-based tables are optimized for scans and aggregations rather than key lookup queries, you can also create computed columns based on JSON functions that will then expose as regular relational columns specific attributes within the original JSON column, simplifying query design and development. you can further optimize data retrieval by creating regular (row-based) non clustered indexes on computed columns to support critical queries and access paths. While these will slightly increase overall storage needs, they will help query processor to filter rows on key lookups and range scans, and can also help on other operations like aggregations and such. Notice that you can add computed columns and related indexes at any time.

In case your JSON structure is stable and known upfront, the best option would be to design our relational schema to accommodate the most relevant attributes from JSON data, and leverage the OPENJSON function to transform these attributes to row fields when inserting new data. These will be fully relational columns (with optimized SQL data types) that can be used for all sort of retrieval and analytical purposes, from complex filtering to aggregations, and with that you just need to create proper indexes to support various access paths. However, you can still decide to keep the entire JSON fragment and store it in a VARCHAR(max) field in the same table if further processing may be needed.

In case, you can save your IoT data as row fields with SQL Data types, consider using also time-based table partitioning . Have all your data in one table but partitioned by range of dates as explained here.

CodePudding user response：

Generally speaking, it is better to aggregate during writes than during reads in terms of performance. Build up multiple tables for multiple views of the data. A table with raw data of the last 24 hours, one with aggregates per hour of the last month and one with aggregates per day for the last year for example.

As per your comment:

Yes, the samples are required for averaging purpose, but firstly i need to retrieve the data first. Once we get the raw data, we average it across an hour. This is done per sensor

It doesn't make sense to store millions and millions of raw data only to read it all into memory and reduce the data point greatly by aggregating in-memory. It is a waste of storage and processing power.

Is there any better solution for my requirement?

What requirement? I don't see any. Anyway, based on the little info you did provide: instead of using a rdbms like sql server I suggest looking for some storage that is purposely build for this kind of data like Azure Time Series Insight. It integrates nicely with IoT Hub

By the way, a tip to get a better answers for future questions: tell us the use case. What is it you need the data for? What kind of queries and/or visualisations is it you want, What analytics do you want to perform?

CodePudding user response：

Would having an individual SQL table for each sensor be a good idea? How many tables can there be in SQL storage?

tables can be created without limit since it is just a file on disk.

But I think it is not good idea for automation.

Because you have to create new table for new sensor each time.

I suggest you to use only one table for all sensors, and add partition key for sensor id.

Creating table or creating partition is same concept because both creates new physical file on disk.

So getting from which table and getting from which partition has no speed difference.

But your problem still remains when you want to get data of two or more months because it accesses multiple partitions (multiple files).

Is there any better solution for my requirement?

I think you need to run batch program that creates summed table for your report as @Peter Bons commented.

Get from summed table until yesterday, get from current table for today, and merge two data for result will be faster.