For example I have multiple dates log file "20100101,20100102....20220222"
I have mapper.py. this script do parse log file and send mapping data to database.
In this case I want to do my mapper(maybe 10 instance) get log and send it to db. and then repeat this job by another date
CodePudding user response:
Hadoop Streaming is already "distributed", but is isolated to one input and output stream. You would need to write a script to loop over the files and run individual streaming jobs per-file.
If you want to batch process many files, then you should upload all files to a single HDFS folder, and then you can use mrjob
(assuming you actually want MapReduce), or you could switch to pyspark
to process them all in parallel, since I see no need to do that sequentially.