Split large CSV file based on row value-CodePudding

The porblem

I have a csv file called data.csv. On each row I have:

timestamp: int
account_id: int
data: float

for instance:

timestamp,account_id,value
10,0,0.262
10,0,0.111
13,1,0.787
14,0,0.990

This file is ordered by timestamp.
The number of row is too big to store all rows in memory.
order of magnitude: 100 M rows, number of account: 5 M

How can I quickly get all rows of a given account_id ? What would be the best way to make the data accessible by account_id ?

Things I tried

to generate a sample:

N_ROW = 10**6
N_ACCOUNT = 10**5

# Generate data to split
with open('./data.csv', 'w') as csv_file:
  csv_file.write('timestamp,account_id,value\n')
  for timestamp in tqdm.tqdm(range(N_ROW), desc='writing csv file to split'):
    account_id = random.randint(1,N_ACCOUNT)
    data = random.random()
    csv_file.write(f'{timestamp},{account_id},{data}\n')

# Clean result folder
if os.path.isdir('./result'):
  shutil.rmtree('./result')
os.mkdir('./result')

Solution 1

Write a script that creates a file for each account, read rows one by one on the original csv, write the row on on the file that corresponds to the account (open and close a file for each row).

Code:

# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
  next(data_file) # skip header
  for row in data_file:
    account_id = row.split(',')[1]
    account_file_path = f'result/{account_id}.csv'
    file_opening_mode = 'a' if os.path.isfile(account_file_path) else 'w'
    with open(account_file_path, file_opening_mode) as account_file:
      account_file.write(row)
    p_bar.update(1)

Issues:

It is quite slow (i think it is inefficient to open and close a file on each row). It takes around 4 minutes for 1 M rows. Even if it works, will it be fast ? Given an account_id I know the name of the file I should read but the system has to look over 5M files to find it. Should I create some kind of binary tree with folders with the leafs being the files ?

Solution 2 (works on small example not on large csv file)

Same idea as solution 1 but instead of opening / closing a file for each row, store files in a dictionary

Code:

# A dict that will contain all files
account_file_dict = {}

# A function given an account id, returns the file to write in (create new file if do not exist)
def get_account_file(account_id):
  file = account_file_dict.get(account_id, None)
  if file is None:
    file = open(f'./result/{account_id}.csv', 'w')
    account_file_dict[account_id] = file
    file.__enter__()
  return file

# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
  next(data_file) # skip header
  for row in data_file:
    account_id = row.split(',')[1]
    account_file = get_account_file(account_id)
    account_file.write(row)
    p_bar.update(1)

Issues:

I am not sure it is actually faster. I have to open simultaneously 5M files (one per account). I get an error OSError: [Errno 24] Too many open files: './result/33725.csv'.

Solution 3 (works on small example not on large csv file)

Use awk command, solution from: split large csv text file based on column value

code:

after generating the file, run: awk -F, 'NR==1 {h=$0; next} {f="./result/"$2".csv"} !($2 in p) {p[$2]; print h > f} {print >> f}' ./data.csv

Issues:

I get the following error: input record number 28229, file ./data.csv source line number 1 (number 28229 is an example, it usually fails around 28k). I assume It is also because i am opening too many files

CodePudding user response：

@VinceM :

While not quite 15 GB, I do have a 7.6 GB one with 3 columns :

-- 148 mn prime numbers, their base-2 log, and their hex

  in0: 7.59GiB 0:00:09 [ 841MiB/s] [ 841MiB/s] [========>] 100%            
  
  148,156,631 lines 7773.641 MB (  8151253694)  /dev/stdin

f="$( grealpath -ePq ~/master_primelist_19d.txt )"

( time ( for __ in '12' '34' '56' '78' '9'; do 

     ( gawk -v ___="${__}" -Mbe 'BEGIN {

               ___="^["(___%((_ =_^=FS=OFS="=") _*_*_)^_)"]" 

       }  ($_)~___ && ($NF = int(($_)^_))^!_' "${f}" & ) done | 

  gcat - ) ) | pvE9 > "${DT}/test_primes_squared_00000002.txt"

 out9: 13.2GiB 0:02:06 [98.4MiB/s] [ 106MiB/s] [ <=> ]

  ( for __ in '12' '34' '56' '78' '9'; do; ( gawk -v ___="${__}" -Mbe  "${f}" &)  

  0.36s user 3     out9: 13.2GiB 0:02:06 [ 106MiB/s] [ 106MiB/s]

Using only 5 instances of gawk with big-integer package gnu-GMP, each with a designated subset of leading digit(s) of the prime number,

—- it managed to calculate the full precision squaring of those primes in just 2 minutes 6 seconds, yielding an unsorted 13.2 GB output file

if it can square that quickly, then merely grouping by account_id should be a walk in the park

CodePudding user response：

Have a look at https://docs.python.org/3/library/sqlite3.html You could import the data, create required indexes and then run queries normally. No dependencies except for the python itself.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
If you have to query raw data every time and you are limited by simple python only, then you can either write a code to read it manually and yield matched rows or use a helper like this:

from convtools.contrib.tables import Table
from convtools import conversion as c

iterable_of_matched_rows = (
    Table.from_csv("tmp/in.csv", header=True)
    .filter(c.col("account_id") == "1")
    .into_iter_rows(dict)
)

However this won't be faster than reading 100M row csv file with csv.reader.