Home > Software design >  Split a text file into multiple files based on filename given on each line
Split a text file into multiple files based on filename given on each line

Time:06-26

I have a large (>10GB) file which is an InfluxDB line protocol export. Line protocol format is roughly

measurement,tag1=value1,tag2=value2,... value=XXX timestamp

for example

energymeter_total,uuid=c4695262-624c-11ea-b2f7-374e5ccddc43 value=31449.1 1632646540287000000

I want to split this file by measurement, ie. up to the first comma OR space, using the measurement name as the target filename.

This does the job (except for the comma as separator) but is dreadfully slow (runs for 8h on an Intel i5 with SSD storage):

cat ../influx_export | while read FILE VAL TS ; do echo "$FILE $VAL $TS" >> "$FILE" ; done

I'm sure there is a scripted solution (no compiled code) that is at least 10x faster. However, the source file is too big to fit entirely into RAM.

Are there any more efficient approaches using awk, perl, sed, ruby, whatever?

CodePudding user response:

bash is notoriously slow for iterating over a file (because read only reads one character at a time to ensure it doesn't consume anything after a newline that may be intended for a following command to read).

Use awk instead:

awk -F'[, ]' '{
   print $0 >> $1
}' ../influx_export

It's possible that, if there are many unique values for $1, you may wind up with a "too many files open" error. In that case, a simple (if inefficient) solution will be to explicitly close each file immediately after writing to it. Even if awk needs to open a file for each line, this should still be faster than using pure bash.

awk -F'[, ]' '{
   print $0 >> $1; close($1)
}' ../influx_export

CodePudding user response:

Don't use shell loops to manipulate text, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

Chances are this, using a DSU approach, is close to what you want if not exactly correct:

awk -F'[, ]' '{print $1, NR, $0}' file |
sort -k1,1 -k2,2n |
awk '
    $1 != out {
        close(out)
        out = $1
    }
    { print $4, $5 > out }
'

but it's obviously untested as you didn't provide sample input and expected output we could test with.

The awk commands each just handle 1 line at a time so use almost no memory and the sort command is designed to handle huge files by using demand paging, etc. so it doesn't need to fit the whole input in RAM and so the above should have no problem efficiently handling your input file.

  • Related