Order of installing Big Data Modules on Ubuntu-CodePudding

What is the order of installing Hadoop, Sqoop, Zookeeper, Spark, Java, Apache, Pig, Hive, Flume, Kafka, Mysql and other packages on Ubuntu?

CodePudding user response：

Start with this https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-in-stand-alone-mode-on-ubuntu-20-04 or https://phoenixnap.com/kb/install-hadoop-ubuntu

Forget PIG, Flume, no longer relevant.

Zookeeper if running a Hadoop Cluster.

Then Spark, then Kafka. Mysql. But order here on this line not so relevant.

CodePudding user response：

Everything you've mentioned, minus mysql, requires Java, so start there.

For high availability of HDFS, or Kafka, you need Zookeeper. Zookeeper has no dependencies, so that's next. (3 servers minimum for a production cluster)

Kafka can be setup next since it has no other dependencies. (Another 3 servers for high availability)

Hive requires a metastore, such as Mysql, so you'd then setup Mysql and run the Hive metastore schema queries on it. (at least 2 servers for read-write mysql replication)

HDFS can be next - multiple namenodes for high availability, datanodes, and YARN. (7 servers for 2 namenodes , 2 resource managers and 3 datanodes nodemanagers)

Hive can optionally use HDFS, so that would be next, assuming you wanted to use it, and you can configure high availability on HDFS namenodes to Zookeeper. Presto or Spark are option that are faster than Hive and would also use the metastore. (2 HiveServers for high availability)

With YARN, HDFS, and Hive, you can setup Spark.

Flume would be next, but only if you actually needed it. Otherwise, code can be configured to write directly to Kafka.

Sqoop is a retired Apache project, and Spark can be used instead. Same for Pig.

In total, a minimal production-ready Hadoop cluster with Kafka and MySQL would require at least 17 servers. If you add load balancers and LDAP/Active Directory, then add more.