Writing to a file parallely while processing in a loop in python-CodePudding

I have a CSV data of 65K. I need to do some processing for each csv line which generates a string at the end. I have to write/append that string in a file.

Psuedo Code:

for row in csv_data:
   processed_string = ...
   file_pointer.write(processed_string   '\n')

How can I make this write operation run in parallel such that main processing operation does not have to include time taken for writing to file? I tried using batch writing (store n lines and then write them at the same time). But it would be really great if you can suggest me some method that can do this parallely. Thanks!

Edit: There are 65K records in a csv file. I am processing it which generate a string (multiline about 10-12). I have to write it to a file. For 65K records, getting a results with 10-15 lines each. Normally code takes 10 mins to run. But adding this file operations increses time to 2-3 mins. So if I can do it parallely without affecting code execution?

Here is the code part.

for i in range(len(queries)): # 65K runs
    Logs.log_query(i, name, version)

    # processed_results = Some processing ...

    # Final Answer
    s = final_results(name, version, processed_results) # Returns a multiline string
    f.write(s   '\n')

"""
EXAMPLE OUTPUT:
-----------------
[0] NAME: Adobe Acrobat Reader DC | VERSION: 21.005
FAISS RESULTS (with cutoff 0.63)
     id                                               name                         version   eol_date extended_eol_date                   major_version minor_version    score
1486469                            Adobe Acrobat Reader DC                    21.005.20054 07-04-2020        07-07-2020                              21           005 0.966597
 327901                            Adobe Acrobat Reader DC                    21.005.20048 07-04-2020        07-07-2020                              21           005 0.961541
 327904                            Adobe Acrobat Reader DC                    21.007.20095 07-04-2020        07-07-2020                              21           007 0.960825
 327905                            Adobe Acrobat Reader DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.960557
 327902                            Adobe Acrobat Reader DC                    21.005.20060 07-04-2020        07-07-2020                              21           005 0.958580
 327900                            Adobe Acrobat Reader DC                    21.001.20145 07-04-2020        07-07-2020                              21           001 0.956085
 327903                            Adobe Acrobat Reader DC                    21.007.20091 07-04-2020        07-07-2020                              21           007 0.954148
1486465                            Adobe Acrobat Reader DC                    20.006.20034 07-04-2020        07-07-2020                              20           006 0.941820
1486459                            Adobe Acrobat Reader DC                    19.012.20035 07-04-2020        07-07-2020                              19           012 0.928502
1486466                            Adobe Acrobat Reader DC                    20.012.20048 07-04-2020        07-07-2020                              20           012 0.928366
1486458                            Adobe Acrobat Reader DC                    19.012.20034 07-04-2020        07-07-2020                              19           012 0.925761
1486461                            Adobe Acrobat Reader DC                    19.021.20047 07-04-2020        07-07-2020                              19           021 0.922519
1486463                            Adobe Acrobat Reader DC                    19.021.20049 07-04-2020        07-07-2020                              19           021 0.919659
1486462                            Adobe Acrobat Reader DC                    19.021.20048 07-04-2020        07-07-2020                              19           021 0.917590
1486464                            Adobe Acrobat Reader DC                    19.021.20061 07-04-2020        07-07-2020                              19           021 0.912260
1486460                            Adobe Acrobat Reader DC                    19.012.20040 07-04-2020        07-07-2020                              19           012 0.909160
1486457                            Adobe Acrobat Reader DC                    15.008.20082 07-04-2020        07-07-2020                              15           008 0.902536
 327899                                   Adobe Acrobat DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.895940
1277732                        Acrobat Reader DC (classic)                            2015 07-07-2020                 *                            2015           NaN 0.875471

OPEN SEARCH RESULTS (with cutoff 13)
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 66.63623, "id": 327902, "name": Adobe Acrobat Reader DC, "version": 21.005.20060, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 65.96028, "id": 1486469, "name": Adobe Acrobat Reader DC, "version": 21.005.20054, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
FINAL ANSWER [OPENSEARCH]
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
----------------------------------------------------------------------------------------------------

"""

CodePudding user response：

Q : " Writing to a file parallely while processing in a loop in python ... "

A :
Frankly speaking, the file-I/O is not your performance-related enemy.

_{"With all due respect to the colleagues, Python (since ever) used GIL-lock to avoid any level of concurrent execution ( actually re-SERIAL-ising the code-execution flow into dancing among any amount of threads, lending about 100 [ms] of code-interpretation time to one-AFTER-another-AFTER-another, thus only increasing the interpreter's overhead times ( and devastating all pre-fetches into CPU-core caches on each turn ... paying the full mem-I/O costs on each next re-fetch(es) ). So threading is ANTI-pattern in python (except, I may accept, for network-(long)-transport latency masking ) – user3666197 44 mins ago "}

Given about the 65k files, listed in CSV, ought get processed ASAP, the performance-tuned orchestration is the goal, file-I/O being just a negligible ( and by-design well latency-maskable ) part thereof_{( which does not mean, we can't screw it even more ( if trying to organise it in another performance-devastating ANTI-pattern ), can we? )}

Tip #1 : avoid & resist to use any low-hanging fruit SLOCs if The Performance is the goal

If the code starts with a cheapest-ever iterator-clause,
be it a mock-up for aRow in aCsvDataSET: ...
or the real-code for i in range( len( queries ) ): ... - these (besides being known for ages to be awfully slow part of the python code-interpretation capabilites, the second one being even an iterator-on-range()-iterator in Py3 and even a silent RAM-killer in Py2 ecosystem for any larger sized ranges) look nice in "structured-programming" evangelisation, as they form a syntax-compliant separation of a deeper-level part of the code, yet it does so at an awfully high costs impacts due to repetitively paid overhead-costs accumulation. A finally injected need to "coordinate" unordered concurrent file-I/O operations, not necessary in principle at all, if done smart, are one such example of adverse performance impacts if such a trivial SLOC's ( and similarly poor design decisions' ) are being used.

Better way?

a ) avoid the top-level (slow & overhead-expensive) looping
b ) "split" the 65k-parameter space into not much more blocks than how many memory-I/O-channels are present on your physical device ( the scoring process, I can guess from the posted text, is memory-I/O intensive, as some model has to go through all the texts for scoring to happen )
c ) spawn n_jobs-many process workers, that will joblib.Parallel( n_jobs = ... )( delayed( <_scoring_fun_> )( block_start, block_end, ...<_params_>... ) ) and run the scoring_fun(...) for such distributed block-part of the 65k-long parameter space.
d ) having computed the scores and related outputs, each worker-process can and shall file-I/O its own results in its private, exclusively owned, conflicts-prevented output file
e ) having finished all partial block-parts' processing, the main-Python process can just join the already ( just-[CONCURRENTLY] created, smoothly & non-blocking-ly O/S-buffered / interleaved-flow, real-hardware-deposited ) stored outputs, if such a need is ...,
and
finito - we are done ( knowing there is no faster way to compute the same block-of-tasks, that are principally embarrasingly independent, besides the need to orchestrate them collision-free with minimised-add-on-costs).

If interested in tweaking a real-system End-to-End processing-performance,
start with lstopo-map
next verify the number of physical memory-I/O-channels
and
may a bit experiment with Python joblib.Parallel()-process instantiation, under-subscribing or over-subscribing the n_jobs a bit lower or a bit above the number of physical memory-I/O-channels. If the actual processing has some, hidden to us, maskable latencies, there might be chances to spawn more n_jobs-workers, until the End-to-End processing performance keeps steadily growing, until a system-noise hides any such further performance-tweaking effects

Bonus - why un-managed sources of latency kill The Performance