Big data can solve the problem on my large amounts of data fast query-CodePudding

Began to design good, didn't I order is divided into four must level data table association (oracle), when a customer came in, I will list all his orders, such four big table, a thief thief slowly, I just want to a few years ago has been to shout big data, isn't it, large data is how to solve this problem,

High real-time performance, the speed also request soon

CodePudding user response:

You ask too general, this means distributed, large data distributed storage and computing, 4 kw level table joins, quick response, what's the first question is ready what configuration of nodes,,, then how the data distributed management and adopt what the problem of computing framework,,,

CodePudding user response:

Big data should probably do it, the four associated made a wide table in hbase or something good, index on es, so should also can grade second response

CodePudding user response:

Do level table 4 cascade real-time calculation can meet you, real-time query?

Upstairs the answer is usually big data solutions, hbase table data, wide index es do,
Query if you compare a single, you can design good hbase table rowkey directly and query up faster than ES,

CodePudding user response:

Suggest considering wide table, in the kudu, by publishing sparkJobServer rest interface, real-time query, premise machine configuration is more demanding, at least 8 256 gb of memory machine cluster

CodePudding user response:

I continue to add to my question, isn't clear, I'm sorry,

Order watch is first of all, this means that insert is very frequent, followed by the historical data is of great significance, the requirements of the above I need to conform to the conditions of three years of data found out,

Just wanted to ask the big data can be real-time, so have I directly from oracle found out in real time,

CodePudding user response:

The

reference 5 floor cocoa2003 response:

suggest considering wide table, in the kudu, released by sparkJobServer rest interface, real-time query, premise machine configuration is more demanding, at least 8 256 gb of memory machine cluster

real-time orders can I go in

CodePudding user response:

Tell my solution at present, under the good feeling low.

Logic in the library to create a table containing frequently to query field, through the trigger records of these four tables add and modify, per brush in the large table, this way to avoid multi-table query, but bring the performance loss, look very low. Just want to know, how to make use of big data knowledge to solve such problems,

CodePudding user response:

Can put the data in the memory check?

CodePudding user response:

refer to the eighth floor yikewl response:

tell me solution at present, under the good feeling low.

Logic library to create a table containing frequently to query field, through the trigger records of these four tables add and modify, per brush in the large table, this way to avoid joint multi-table query, but the performance loss, look very low. Just want to know, how to make use of big data knowledge to solve such problems,

There is no perfect solution, everything sounds on the tall, principle is very low,
Only conform to the present demand and expand the desired best practices,
Advice from the several main principles of distributed learning, such as CAP theory, distributed consistency algorithm (Raft), BASE theory,
Study of the existing distributed database (no HBase, Mongo, OLTP RDB cluster, such as TiDB HTAP), as well as data partition algorithm, graphs, consistent hash, etc.,

CodePudding user response:

Build a wide table, all of the fields contains four tables, not to the primary key, write a timed tasks, the new data in the table, constant brushes the wide table, then the query directly check table wide, wide watch is slow and put wide table in the Mongo, directly through the Mongo query?

CodePudding user response:

Big data is the meaning of two aspects: 1 is very "big" discrete data, such as distribution of tens of billions of records on thousands of machines, how to write thousands of table as a table to query, 2 is a pile of mathematical statistical package for the most basic of the so-called mental derangement network classification algorithm, using the buzzword of big data for 1, never say it faster than the speed of single machine processing, the somebody else is impossible to single, so we divided into thousands of machines for storage,

Behind so he went back to your question, what are you doing in the database "trigger" to write data to the cache, this is putting the cart before the horse, the cached data should be first, do not fall to the ground, even if has not been saved to the database data backup before, you should fall in the cache, and database operations can be asynchronous, do not need to block the real business operations, in particular, should not go to let many threads get any database transaction locks garbage restrictions, so the cache operation as a result of database table trigger is completely is putting the cart before the horse,

CodePudding user response:

Full to check on the relational database to add, what technology to a database table, the trigger on contact,

CodePudding user response:

refer to 12 floor sp1234 reply:

big data is the meaning of two aspects: one is very "big" discrete data, such as distribution of tens of billions of records on thousands of machines, how to write thousands of table as a table to query, 2 is a pile of mathematical statistical package for the most basic of the so-called mental derangement network classification algorithm, using the buzzword of big data for 1, never say it faster than the speed of single machine processing, the somebody else is impossible to single, so we divided into thousands of machines for storage,

Behind so he went back to your question, what are you doing in the database "trigger" to write data to the cache, this is putting the cart before the horse, the cached data should be first, do not fall to the ground, even haven't save the data to a database backup before, you should fall in the cache, and database operations can be asynchronous, do not need to block the real business operations, in particular, should not go to let many threads get any database transaction locks garbage restrictions, so the cache operation as a result of database table trigger is completely is putting the cart before the horse,

Carefully to see your reply, feel very reasonable, but can't think of a solution, please instruct me, help me to refer to a smaller direction, thank you

CodePudding user response:

1. From a business point of view, the real time obtaining historical order, if we can distinguish between dynamic change order + history the same order,
2. If you can separate, can be used for dynamic change order oracle query, can build a wide table for history the same order, spark streaming real-time computing results, stored in the hbase, for front query,

CodePudding user response:

Although I was a beginner, I also give some ideas, I suggest you use solr plus hbase and spark, the calculating correlation data using spark into hbase, made the whole field solr index, visit solr to check again, and thinking about the same, 1/f, speed is faster, as for the four tables associated optimization has a lot of, the spark can do here, for the problem of incremental updating, can do it in the solr, if delete the logo and time stamp, can automatically realize incremental updates, a little humble opinion

CodePudding user response:

Needs to make clear, usually list "all" order is not completely all, because a screen also does not fit, paging, like ordinary electricity company usually used mysql points under the library, a single repository for each table, generally up to 3 kw to 5 kw, are millisecond to query results, four table join the optimization of the relational database, data quantity is little, you this points on each table of only a few million, it is not necessary to use the spark, hadoop these weapons, and can tell you that this application is not fast, the cost is very high,