How to distribute Mapreduce task in hadoop streaming-CodePudding

For example I have multiple dates log file "20100101,20100102....20220222"
I have mapper.py. this script do parse log file and send mapping data to database.
In this case I want to do my mapper(maybe 10 instance) get log and send it to db. and then repeat this job by another date

CodePudding user response：

Hadoop Streaming is already "distributed", but is isolated to one input and output stream. You would need to write a script to loop over the files and run individual streaming jobs per-file.

If you want to batch process many files, then you should upload all files to a single HDFS folder, and then you can use mrjob (assuming you actually want MapReduce), or you could switch to pyspark to process them all in parallel, since I see no need to do that sequentially.