How can I load and merge several .txt files in a memory efficient way in python?-CodePudding

I am trying to read several (>1000) .txt files and merge them into one pandas dataframe (to run an analysis on the entire dataset). I am running this through SSH on a HPC server, on which I requested 50GB RAM, 1 node, 1 task-per-node (that was all just a wild guess, never done this before). So far, my idea was this:

li = []

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

but after a few hours and loading the approx. 360th file the process gets killed and I get the error message:

numpy.core._exceptions.MemoryError: Unable to allocate 1.11 GiB for an array with shape (10, 14921599) and data type float64

Do you have any idea how to load and merge the data more memory efficient? (I assume just requesting more RAM still does not get me through the entire set of .txt files!?)

Also, I would like to save the resulting dataframe in a 'memory-efficient' way afterwards, do you know the best way/format (csv?) to do that?

Any help would be much appreciated!

CodePudding user response：

as you said you have so many files and it needs so much memory so I suggest loading and saving all of the files in a single file in appending mode (append data to the previously saved data) like this

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    df.to_csv('./data.csv', header=None, index=None, mode='a')

after saving all of the files in single file now you can read the single file as a dataframe like this:

df = pd.read_csv('./data.csv',header=None,index_col=False)

after that if you have any issues with reading this file because of memory you can use a reader like this:

chunksize = 10 ** 6
with pd.read_csv('./data.csv',header=None,index_col=False, chunksize=chunksize) as reader:
    for chunk in reader:
        # Do What you want

CodePudding user response：

Q : "How can I ... Any help would be much appreciated!"

A :
best follow the laws of the ECONOMY-of-COMPUTING :

Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.

Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )

ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]

Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( M^x * N^y * ... )-scaling related issues in further modelling.

                0.1 ns - CPU NOP - a DO-NOTHING instruction
                0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
              3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                7   ns - CPU L2  CACHE reference
               19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
               71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
              100   ns - own DDR MEMORY reference
              135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
              325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
            2,500   ns - Read  10 kB sequentially from  MEMORY------ HPC-node
           25,000   ns - Read 100 kB sequentially from  MEMORY------ HPC-node
          250,000   ns - Read   1 MB sequentially from  MEMORY------ HPC-node
        2,500,000   ns - Read  10 MB sequentially from  MEMORY------ HPC-node
       25,000,000   ns - Read 100 MB sequentially from  MEMORY------ HPC-node
_____________________________________________________________________________own_CPU/DDR__________________________
          500,000   ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
       20,000,000   ns - Send   2 MB over 1 Gbps  NETWORK
      200,000,000   ns - Send  20 MB over 1 Gbps  NETWORK
    2,000,000,000   ns - Send 200 MB over 1 Gbps  NETWORK
   20,000,000,000   ns - Send   2 GB over 1 Gbps  NETWORK
  200,000,000,000   ns - Send  20 GB over 1 Gbps  NETWORK
  300,000,000,000   ns - Send  30 GB over 1 Gbps  NETWORK
____________________________________________________________________________via_LAN_______________________________
      150,000,000   ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
       10,000,000   ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
       30,000,000   ns - DISK   1 MB sequential READ from a DISK
      300,000,000   ns - DISK  10 MB sequential READ from a DISK
    3,000,000,000   ns - DISK 100 MB sequential READ from a DISK
   30,000,000,000   ns - DISK   1 GB sequential READ from a DISK
  300,000,000,000   ns - DISK  10 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
 |   |   |   |   |
 |   |   |   | ns|
 |   |   | us|
 |   | ms|
 |  s|
h|

Given these elements, the end-to-end computing strategy may and shall be improved.

AS-WAS STATE ... where the crash prevented any computing at all

A naive figure shows more than thousands words

localhost
:       file-I/O ~ 25 GB SLOWEST/EXPENSIVE
:         2nd time 25 GB file-I/O-s
:                  |
:                  | RAM       |
:                  |           |
 ------            |IOIOIOIOIOI|
|.CSV 0|           |IOIOIOIOIOI|
| ------           |IOIOIOIOIOI|
||.CSV 1|          |IOIOIOIOIOI|
|| ------          |IOIOIOIOIOI|-> local ssh()-encrypt encapsulate-process
|||.CSV 2|         |IOIOIOIOIOI|               25 GB of .CSV
 ||      |         |IOIOIOIOIOI|~~~~~~~|
 ||      |         |IOIOIOIOIOI|~~~~~~~|
  |      |         |           |~~~~~~~|
  |      |         |           |~~~~~~~|
   ------          |           |~~~~~~~|
  ...              |           |~~~~~~~|-> LAN  SLOW
   ...             |                   |   WAN  SLOWER
    ...            |                   |   transfer of 30 GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
     ------        |                   |               |                   30 GB           file-I/O ~ 25 GB SLOWEST/EXPENSIVE
    |.CSV 9|       |                   |~~~~~~~~~~~~~~~|                                     2nd time 25 GB file-I/O-s
    | ------       |                   |~~~~~~~~~~~~~~~|
    ||.CSV 9|      |                   |~~~~~~~~~~~~~~~|
    || ------      |                   |~~~~~~~~~~~~~~~|
    |||.CSV 9|     |                   |~~~~~~~~~~~~~~~|
     ||     9|     |                   |~~~~~~~~~~~~~~~|
     ||     9|     |                   |~~~~~~~~~~~~~~~|
      |      |     |                   |~~~~~~~~~~~~~~~|
      |      |     |                   |~~~~~~~~~~~~~~~|-> file-I/O into python
       ------      |                                   |   all .CSV file to RAM ~ 25 GB SLOWEST/EXPENSIVE
                   |                                   |***|                        3rd time 25 GB file-I/O-s
                   |                                   |   RAM .CSV to df CPU work
                   |                                   |***|           df to LIST new RAM-allocation   list.append( df )-costs
                   |                                   |***|                      25 GB
                   |                                   |***|
many hours         |                                   |***|
  [SERIAL] flow ...|                                   |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
                   |                       crashed     |***|
                   |                    on about       |***|
                   |                       360-th file |***|
                   |                                   |***|->RAM ~ 50~GB with a LIST of all 25 GB dataframes held in LIST
                   |                                       |  CPU  mem-I/O costs LIST to new 25 GB dataframe RAM-allocation & DATA-processing
                   |                                       |~~~~~| mem-I/O RAM|              :: GB
                   |                                       |~~~~~| mem-I/O    |RAM flow of ~ 50 GB over only 2/3/? mem-I/O HW-channels
                   |                                       |~~~~~|                                      only if "HPC"
                   |                                       |~~~~~|                                           is *NOT* a "shared"-rental of cloud HW,
                   |                                       |~~~~~|                                                    remarketed as an "HPC"-illusion
                   |                                       |~~~~~|
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25 GB dataframe...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work         the more                        ...>
                   |                                            <...some amount of some usefull work         the better                      ...>
                   |                                            <...some amount of some usefull work             as                          ...>
                   |                                            <...some amount of some usefull work             it                          ...>
                   |                                            <...some amount of some usefull work             dissolves to AWFULLY        ...>
                   |                                            <...some amount of some usefull work                          HIGH           ...>
                   |                                            <...some amount of some usefull work                          SETUP COSTS    ...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25 GB dataframe...>
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                                          |-> file-I/O ~ 25 GB SLOWEST/EXPENSIVE
                   |                                                          |~~~~~|          4th time 25 GB file-I/O-s
                   |                                                          |~~~~~|
                   |                                                          |~~~~~|->file left on remote storage (?)
                   |                                                                |
                   |                                                               O?R
                   |                                                                |
                   |                                                                |-> file-I/O ~ 25 GB SLOWEST/EXPENSIVE
                   |                                                                |~~~~~|          5th time 25 GB file-I/O-s
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|-> RAM / CPU ssh()-encrypt encapsulate-process
                   |                                                                      |????????????|      25 GB of results for repatriation
                   |                                                                      |????????????|                        on localhost
                   |                                                                      |????????????|
                   |                                                                      |????????????|
                   |                                                                      |????????????|-> LAN  SLOW
                   |                                                                                   |   WAN  SLOWER
                   |                                                                                   |   transfer of 30 GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
                   |                                                                                   |               |                     30 GB           file-I/O ~ 25 GB SLOWEST/EXPENSIVE
                   |                                                                                   |~~~~~~~~~~~~~~~|                                       6th time 25 GB file-I/O-s
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
 SUCCESS ?         |                                                                                   |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
    after          |
          how many |
          failed   |
          attempts |
   having          |
          how high |
          recurring|
          costs    |
          for any  |
              next |
              model|
          recompute|
          step(s)  |
                   |
                   |
All                |
that               |
( at what overall  |
    [TIME]-domain  |
   & "HPC"-rental  |
           costs ) |

Tips :

review and reduce, where possible, expensive data-items representation ( avoid using int64, where 8-bits are enough, packed bitmaps can help a lot )
precompute on localhost all items, that could be precomputed ( avoiding repetitive steps )
join the such "reduced" CSV-files, using a trivial O/S command, into a single input
compress all data before transports ( a few orders of magnitude saved )
prefer algorithms, that can stream-process data-flow, not waiting to load all in-RAM to next compute an average or similar trivial on-the-fly stream-computable values ( reduces both RAM-allocations & the [SERIAL]-one-after-another processing pipeline latency )
if indeed in a need to use pandas and fighting on overall physical RAM-ceilings, may try smart numpy-tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )