Home > Software engineering >  How to distribute Mapreduce task in hadoop streaming
How to distribute Mapreduce task in hadoop streaming

Time:02-25

For example I have multiple dates log file "20100101,20100102....20220222"
I have mapper.py. this script do parse log file and send mapping data to database.
In this case I want to do my mapper(maybe 10 instance) get log and send it to db. and then repeat this job by another date

CodePudding user response:

Hadoop Streaming is already "distributed", but is isolated to one input and output stream. You would need to write a script to loop over the files and run individual streaming jobs per-file.

If you want to batch process many files, then you should upload all files to a single HDFS folder, and then you can use mrjob (assuming you actually want MapReduce), or you could switch to pyspark to process them all in parallel, since I see no need to do that sequentially.

  • Related