How to build a large data centers-CodePudding

Background:
Recently the company needs to build large data center, data center to build in A city, data from each application system, which located in B, C, D, now need to be located in A, B, C, the subsystem of data (dispersed in each table) to A data center for data processing, A, B, C of each subsystem of data stored in the MSSQL,

Because before you didn't do too much data processing and building data center, lack of experience, now have the following questions, also please tell everybody, thank you very much!
1, the transmission of data from the application system to the data center, should adopt what technology,
2, the data center of the machine's operating system, data storage, data processing, cluster management, should adopt what systems and technology, and how to structure,

Please everybody to give directions, if you feel trouble, as long as it is pointed out that where need what technology can, please, thank you very much!

CodePudding user response:

Because before you didn't do too much data processing and building data center, lack of experience, now have the following questions, also please tell everybody, thank you very much!
1, the transmission of data from the application system to the data center, should adopt what technology,
2, the data center of the machine's operating system, data storage, data processing, cluster management, should adopt what systems and technology, and how to structure,

Please everybody to give directions, if you feel trouble, as long as it is pointed out that where need what technology can, please, thank you very much!

CodePudding user response:

Transfer do not understand, but the store can use HDFS,
MSSQL data can, by using the method of master-slave replication in a copy of the data center set up, and then by Sqoop guide into parquet file format in the HDFS, and data analysis through the hive/spark visit upper big data applications, such as
Server logs can through the flume HDFS collected, and then through the ELK (Logstash and Elasticsearch, Kibana) is analyzed, but it is before we flume sink directly to HBase, use a Spark to access (Spark our data analysis basic around) and analysis,

CodePudding user response:

And if it is can use ambari to build large-scale cluster and monitoring, ambari automatically help you assemble Hortonworks distribution of Hadoop (HDP), can also according to the assembly to other ecological components such as hbase Hadoop hive ZooKeeper spark and so on, but according to my actual it compare with centos compatible, this is about to consider the problem that commonly used Linux,,,
If really cow force data center can also build a private cloud cloud (OpenStark) and container (Docker), but this is completely don't understand

CodePudding user response:

https://nieoding.gitbooks.io/spark/content/

CodePudding user response:

reference link0007 reply: 3/f

and if it is can use ambari to build large-scale cluster and monitoring, ambari automatically help you assemble Hortonworks distributions of Hadoop (HDP), can also according to the assembly to other ecological components such as hbase Hadoop hive ZooKeeper spark and so on, but according to my actual it compare with centos compatible, this is about to consider the problem that commonly used Linux,,,
If there is data center that the cow force can also build a private cloud cloud (OpenStark) and container (Docker), but this doesn't understand

Thank you very much!
Due to the statistical analysis takes time, so, how can achieve real-time query, then get the results of the analysis, have any idea? My idea is to use a database to store the results of the analysis, what do you think? Thank you very much!

CodePudding user response:

reference 4 floor IamNieo response:

https://nieoding.gitbooks.io/spark/content/

Thank you very much, data is very practical,

CodePudding user response:

reference 5 floor strongyoung88 reply:

Quote: refer to the third floor link0007 response:

And if it is can use ambari to build large-scale cluster and monitoring, ambari automatically help you assemble Hortonworks distribution of Hadoop (HDP), can also according to the assembly to other ecological components such as hbase Hadoop hive ZooKeeper spark and so on, but according to my actual it compare with centos compatible, this is about to consider the problem that commonly used Linux,,,
If there is data center that the cow force can also build a private cloud cloud (OpenStark) and container (Docker), but this doesn't understand

Thank you very much!
Due to the statistical analysis takes time, so, how can achieve real-time query, then get the results of the analysis, have any idea? My idea is to use a database to store the results of the analysis, what do you think? Thank you very much!

It will divide the situation, suppose you are ordering system statistical query, you can put the order data published to kafka, then use SparkStreaming this batch flow engine based on time interval is to deal with, and the results can be output to the database, the front-end to real-time query, but if is relatively large, can use a Hive or impala SQL on Hadoop to do ad-hoc query, if it's related operations such as log analysis can use the log analysis such as ELK architecture (write wrong, before the Flume and Logstash is logging acquisition ETL tool, so the Flume can replace Logstash),

CodePudding user response:

A private cloud cloud (OpenStark) and container (Docker)

CodePudding user response:

references 9 f jintian520mingtian response:

I am a junior student, want to learn, having large data subjects did, because he is a computer system, but go out to work is required to have working experience, especially the program apes, graduated until now, to find a lot of training organizations are not very satisfied, listen to the classmate say ran large data training course is good, experienced friend please give some advice, thank you very much!!!!!!

Such advertising, interesting???????