Split csv file into smaller csv files every 30 days using awk-CodePudding

I have a csv table that has a column that contains the date difference in days as integer values, and I want to split the file into 30 day intervals, such that the first file contains rows with the column values 0-30, the second one containing 31-60, and so on. How can I do that using the awk or split command?

Sample data:

ID, creation date, answer date, difference in days
1190, 2017-10-24 13:57:15.91 UTC, 2017-10-24 15:14:00.53 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 0

Whenever the last column (difference in days) reaches 30 then I want to split the file with all the previous rows moved to the new file.

I tried this awk command but the results are not what I wanted:

awk -F, '{if($14 % 30 == 0) x="F"  i;}{print > $14/30"months.csv"}' file.csv

CodePudding user response：

Firstly I suggest using integer division here rather than increase-when-divides-evenly, consider following sequence

1 2 3 4 7 9 10 10

and say you are dividing into size-5 interval, now if you increase-when-divides-evenly that would give

1 2 3 4 7 9 - 10 - 10

so two group were joined into one because no 5 is present and another was split into 2 groups due to repeated 10. Using integer division would give

1 2 3 4 - 7 9 - 10 10

Simple GNU AWK implementation let file.txt be

1,1
2,2
3,3
4,4
7,7
9,9
10,A
10,A

then

awk '{print > "file" int($1/5) ".txt"}' file.txt

creates file0.txt holding

1,1
2,2
3,3
4,4

and file1.txt holding

7,7
9,9

and file2.txt holding

10,A
10,A

Explanation: int(x/y) is integer division it does divide then takes floor.

(tested in GNU Awk 5.0.1)

CodePudding user response：

awk -F, 'BEGIN{ i=0; x="F"  i }{print > "months_"x".csv"}{if($4 % 30 == 0 && $4>0) x="F"  i;}' input.csv

BEGIN{ i=0; x="F" i } initialize i and x
{print > "months_"x".csv"} copy line to "months_{x}.csv",
{if($4 % 30 == 0 && $4>0) x="F" i;} Only after copying the line, check if the column (the 4th one) has reached 30, and change x.

NOTE: The header will only be in the first output file ("months_F1.csv")

With this input:

ID, creation date, answer date, difference in days
1190, 2017-10-24 13:57:15.91 UTC, 2017-10-24 15:14:00.53 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 30
1190, 2017-10-24 13:57:15.91 UTC, 2017-10-24 15:14:00.53 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 30

the following files will be created: "months_F1.csv"

ID, creation date, answer date, difference in days
1190, 2017-10-24 13:57:15.91 UTC, 2017-10-24 15:14:00.53 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 30

and "months_F2.csv"

1190, 2017-10-24 13:57:15.91 UTC, 2017-10-24 15:14:00.53 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 0
610, 2017-10-24 13:57:15.91 UTC, 2017-10-24 14:58:17.77 UTC, 30