Home > Enterprise >  Is there a better way to read groups of CSV rows from the same file?
Is there a better way to read groups of CSV rows from the same file?

Time:01-03

I have a function that performs actions on 12 rows of CSV data.

The 12 rows are read from a larger CSV file. When it finishes, the process is repeated 5 more times on the susequent batches of 12 rows.


(defun main ()
  (send-actions 0 12 "client-one")
  (send-actions 12 24 "client-two")
   ....
  (send-actions 60 72 "client-six")
   ...)

(defun send-actions (start end account)
  (check-upload-send (subseq (read-csv #P"~/Desktop/data.csv") start end) account))

(defun check-upload-send (table account)
  (function to check for duplicates....)
  (function to perform action 1....)
  ...)

This works really well. Most of the time.

Every so often it will throw a duplicate error. This is because it will re-read row 25 (which is the first item from running (send-actions 24 36 "client-three").

Is there a better function or approach to reading a group of csv lines and performing an action on it? With the ability to move to the next batch of 12 lines?

Thanks

note - read-csv is from cl-csv

CodePudding user response:

Let's use this /tmp/test.csv file:

01,02,03,04,05
za,zb,zc,zd,ze
11,12,13,14,15
aa,ab,ac,ad,ae
21,22,23,24,25
ba,bb,bc,bd,be
31,32,33,34,35
ca,cb,cc,cd,ce
41,42,43,44,45
da,db,dc,dd,de
51,52,53,54,55
ea,eb,ec,ed,ee

(I have 6 entries grouped by 2 rows)

When using cl-csv:read-csv with a row-fn argument, the rows are not collected but given to a callback function. I know sometimes it is called "Callback Hell" when there are too many levels of callbacks, but I tend to like this approach because it is easy to compose. Furthermore, you only read the file once and do not need to store an indeterminate amount of data in memory (thought here it is negligible).

For example, I can write a higher-order function make-group-reader that creates a closure, which collects rows in groups of size size, and then calls another callback group-fn when the group is complete. Note that here the group callback is never called on the last fragment if it is incomplete (but you could hack something if needed):

(defun make-group-reader (size group-fn &key (sharedp t))
  (check-type size (integer 1))
  (let ((group (make-array size :initial-element nil :fill-pointer 0)))
    (let ((limit (1- size)))
      (lambda (row)
        (when (= limit (vector-push row group))
          (funcall group-fn (if sharedp group (copy-seq group)))
          (setf (fill-pointer group) 0))))))

The sharedp argument is T by default and gives directly the internal vector to the callback. If you prefer you can copy the vector first with COPY-SEQ by setting it to NIL.

> (cl-csv:read-csv #P"/tmp/test.csv" :row-fn (make-group-reader 2 #'print))

#(("01" "02" "03" "04" "05") ("za" "zb" "zc" "zd" "ze")) 
#(("11" "12" "13" "14" "15") ("aa" "ab" "ac" "ad" "ae")) 
#(("21" "22" "23" "24" "25") ("ba" "bb" "bc" "bd" "be")) 
#(("31" "32" "33" "34" "35") ("ca" "cb" "cc" "cd" "ce")) 
#(("41" "42" "43" "44" "45") ("da" "db" "dc" "dd" "de")) 
#(("51" "52" "53" "54" "55") ("ea" "eb" "ec" "ed" "ee"))

Then, I would make another callback that pops items from a list of clients, for each incoming group. This is used to associate a group to a client and call another function:

(defun make-clients-callback (callback clients)
  (lambda (group)
    (funcall callback group (pop clients))))

Let's define a sample list of clients:

> (defvar *clients* (loop for i from 1 to 6 collect (format nil "client ~r" i)))
*CLIENTS*

> *clients*
("client one" "client two" "client three" "client four" "client five" "client six")

Also, a debug-client-cb for tests:

> (defun debug-client-cb (group client)
    (print `(:client ,client :group ,group)))
DEBUG-CLIENT-CB

Then the following call groups rows by 2 and calls our debugging function for each group with the associated client.

> (cl-csv:read-csv
    #P"/tmp/test.csv"
   :row-fn (make-group-reader 2 (make-clients-callback 'debug-client-cb 
                                                       *clients*)))
(:CLIENT "client one" :GROUP #(("01" "02" "03" "04" "05") ("za" "zb" "zc" "zd" "ze"))) 
(:CLIENT "client two" :GROUP #(("11" "12" "13" "14" "15") ("aa" "ab" "ac" "ad" "ae"))) 
(:CLIENT "client three" :GROUP #(("21" "22" "23" "24" "25") ("ba" "bb" "bc" "bd" "be"))) 
(:CLIENT "client four" :GROUP #(("31" "32" "33" "34" "35") ("ca" "cb" "cc" "cd" "ce"))) 
(:CLIENT "client five" :GROUP #(("41" "42" "43" "44" "45") ("da" "db" "dc" "dd" "de"))) 
(:CLIENT "client six" :GROUP #(("51" "52" "53" "54" "55") ("ea" "eb" "ec" "ed" "ee"))) 

You can simplify things a bit as follows:

(defun make-my-csv-reader (&optional (clients *clients*))
  (make-group-reader 2 (make-clients-callback #'check-upload-send clients)))

And pass (make-my-csv-reader) as a :row-fn.


I am curious why would would use this approach over the subseq approach? Is the main reason because the data is read only once?

Yes, I try to avoid loading the whole file when processing data, it is a bit more robust to treat data as a stream of values and to build a result incrementally than having a step where I need an indeterminate amount of memory.

In some cases this is not the most rational approach because it is easier/faster to read the whole file in one pass and process it.

Are there any other reasons?

I like how the complexity is broken down in layers: first grouping items, then processing them, etc. with the possibility of adding any intermediate function in the pipeline to debug, etc.

Also, the callbacks are closure which can be used to control when processing ends. Instead of reading every rows and process them I can return early from a closure:

> (block :find-row
    (cl-csv:read-csv #P"/tmp/test.csv" 
                     :row-fn (lambda (row) 
                               (when (find "aa" row :test #'string=)
                                 (return-from :find-row row)))))

("aa" "ab" "ac" "ad" "ae")
  • Related