I am trying to read several (>1000) .txt files and merge them into one pandas dataframe (to run an analysis on the entire dataset). I am running this through SSH on a HPC server, on which I requested 50GB RAM, 1 node, 1 task-per-node (that was all just a wild guess, never done this before). So far, my idea was this:
li = []
for filename in all_files:
df = pd.read_csv(filename, sep=None, header=0, engine='python')
li.append(df)
df = pd.concat(li, axis=0, ignore_index=True)
but after a few hours and loading the approx. 360th file the process gets killed and I get the error message:
numpy.core._exceptions.MemoryError: Unable to allocate 1.11 GiB for an array with shape (10, 14921599) and data type float64
Do you have any idea how to load and merge the data more memory efficient? (I assume just requesting more RAM still does not get me through the entire set of .txt files!?)
Also, I would like to save the resulting dataframe in a 'memory-efficient' way afterwards, do you know the best way/format (csv?) to do that?
Any help would be much appreciated!
CodePudding user response:
as you said you have so many files and it needs so much memory so I suggest loading and saving all of the files in a single file in appending mode (append data to the previously saved data) like this
for filename in all_files:
df = pd.read_csv(filename, sep=None, header=0, engine='python')
df.to_csv('./data.csv', header=None, index=None, mode='a')
after saving all of the files in single file now you can read the single file as a dataframe like this:
df = pd.read_csv('./data.csv',header=None,index_col=False)
after that if you have any issues with reading this file because of memory you can use a reader like this:
chunksize = 10 ** 6
with pd.read_csv('./data.csv',header=None,index_col=False, chunksize=chunksize) as reader:
for chunk in reader:
# Do What you want
CodePudding user response:
Q : "How can I ... Any help would be much appreciated!"
A :
best follow the laws of the ECONOMY-of-COMPUTING :
Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.
Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )
ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]
Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( Mx * Ny * ... )-scaling related issues in further modelling.
0.1 ns - CPU NOP - a DO-NOTHING instruction
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
3~4 ns - CPU L2 CACHE reference (2020/Q1)
7 ns - CPU L2 CACHE reference
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
2,500 ns - Read 10 kB sequentially from MEMORY------ HPC-node
25,000 ns - Read 100 kB sequentially from MEMORY------ HPC-node
250,000 ns - Read 1 MB sequentially from MEMORY------ HPC-node
2,500,000 ns - Read 10 MB sequentially from MEMORY------ HPC-node
25,000,000 ns - Read 100 MB sequentially from MEMORY------ HPC-node
_____________________________________________________________________________own_CPU/DDR__________________________
500,000 ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
20,000,000 ns - Send 2 MB over 1 Gbps NETWORK
200,000,000 ns - Send 20 MB over 1 Gbps NETWORK
2,000,000,000 ns - Send 200 MB over 1 Gbps NETWORK
20,000,000,000 ns - Send 2 GB over 1 Gbps NETWORK
200,000,000,000 ns - Send 20 GB over 1 Gbps NETWORK
300,000,000,000 ns - Send 30 GB over 1 Gbps NETWORK
____________________________________________________________________________via_LAN_______________________________
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
10,000,000 ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
30,000,000 ns - DISK 1 MB sequential READ from a DISK
300,000,000 ns - DISK 10 MB sequential READ from a DISK
3,000,000,000 ns - DISK 100 MB sequential READ from a DISK
30,000,000,000 ns - DISK 1 GB sequential READ from a DISK
300,000,000,000 ns - DISK 10 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
| | | | |
| | | | ns|
| | | us|
| | ms|
| s|
h|
Given these elements, the end-to-end computing strategy may and shall be improved.
AS-WAS STATE ... where the crash prevented any computing at all
A naive figure shows more than thousands words
localhost
: file-I/O ~ 25 GB SLOWEST/EXPENSIVE
: 2nd time 25 GB file-I/O-s
: |
: | RAM |
: | |
------ |IOIOIOIOIOI|
|.CSV 0| |IOIOIOIOIOI|
| ------ |IOIOIOIOIOI|
||.CSV 1| |IOIOIOIOIOI|
|| ------ |IOIOIOIOIOI|-> local ssh()-encrypt encapsulate-process
|||.CSV 2| |IOIOIOIOIOI| 25 GB of .CSV
|| | |IOIOIOIOIOI|~~~~~~~|
|| | |IOIOIOIOIOI|~~~~~~~|
| | | |~~~~~~~|
| | | |~~~~~~~|
------ | |~~~~~~~|
... | |~~~~~~~|-> LAN SLOW
... | | WAN SLOWER
... | | transfer of 30 GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
------ | | | 30 GB file-I/O ~ 25 GB SLOWEST/EXPENSIVE
|.CSV 9| | |~~~~~~~~~~~~~~~| 2nd time 25 GB file-I/O-s
| ------ | |~~~~~~~~~~~~~~~|
||.CSV 9| | |~~~~~~~~~~~~~~~|
|| ------ | |~~~~~~~~~~~~~~~|
|||.CSV 9| | |~~~~~~~~~~~~~~~|
|| 9| | |~~~~~~~~~~~~~~~|
|| 9| | |~~~~~~~~~~~~~~~|
| | | |~~~~~~~~~~~~~~~|
| | | |~~~~~~~~~~~~~~~|-> file-I/O into python
------ | | all .CSV file to RAM ~ 25 GB SLOWEST/EXPENSIVE
| |***| 3rd time 25 GB file-I/O-s
| | RAM .CSV to df CPU work
| |***| df to LIST new RAM-allocation list.append( df )-costs
| |***| 25 GB
| |***|
many hours | |***|
[SERIAL] flow ...| |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
| crashed |***|
| on about |***|
| 360-th file |***|
| |***|->RAM ~ 50~GB with a LIST of all 25 GB dataframes held in LIST
| | CPU mem-I/O costs LIST to new 25 GB dataframe RAM-allocation & DATA-processing
| |~~~~~| mem-I/O RAM| :: GB
| |~~~~~| mem-I/O |RAM flow of ~ 50 GB over only 2/3/? mem-I/O HW-channels
| |~~~~~| only if "HPC"
| |~~~~~| is *NOT* a "shared"-rental of cloud HW,
| |~~~~~| remarketed as an "HPC"-illusion
| |~~~~~|
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| <...some amount of some usefull work --"HPC"-processing the ~ 25 GB dataframe...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work the more ...>
| <...some amount of some usefull work the better ...>
| <...some amount of some usefull work as ...>
| <...some amount of some usefull work it ...>
| <...some amount of some usefull work dissolves to AWFULLY ...>
| <...some amount of some usefull work HIGH ...>
| <...some amount of some usefull work SETUP COSTS ...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work --"HPC"-processing the ~ 25 GB dataframe...>
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| |-> file-I/O ~ 25 GB SLOWEST/EXPENSIVE
| |~~~~~| 4th time 25 GB file-I/O-s
| |~~~~~|
| |~~~~~|->file left on remote storage (?)
| |
| O?R
| |
| |-> file-I/O ~ 25 GB SLOWEST/EXPENSIVE
| |~~~~~| 5th time 25 GB file-I/O-s
| |~~~~~|
| |~~~~~|
| |~~~~~|
| |~~~~~|-> RAM / CPU ssh()-encrypt encapsulate-process
| |????????????| 25 GB of results for repatriation
| |????????????| on localhost
| |????????????|
| |????????????|
| |????????????|-> LAN SLOW
| | WAN SLOWER
| | transfer of 30 GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
| | | 30 GB file-I/O ~ 25 GB SLOWEST/EXPENSIVE
| |~~~~~~~~~~~~~~~| 6th time 25 GB file-I/O-s
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
SUCCESS ? | |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
after |
how many |
failed |
attempts |
having |
how high |
recurring|
costs |
for any |
next |
model|
recompute|
step(s) |
|
|
All |
that |
( at what overall |
[TIME]-domain |
& "HPC"-rental |
costs ) |
Tips :
- review and reduce, where possible, expensive data-items representation ( avoid using int64, where 8-bits are enough, packed bitmaps can help a lot )
- precompute on localhost all items, that could be precomputed ( avoiding repetitive steps )
- join the such "reduced" CSV-files, using a trivial O/S command, into a single input
- compress all data before transports ( a few orders of magnitude saved )
- prefer algorithms, that can stream-process data-flow, not waiting to load all in-RAM to next compute an average or similar trivial on-the-fly stream-computable values ( reduces both RAM-allocations & the [SERIAL]-one-after-another processing pipeline latency )
- if indeed in a need to use pandas and fighting on overall physical RAM-ceilings, may try smart
numpy
-tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )