I have a large (15m rows) csv file with multiple term periods (STRT_DTE and EXP_DATE) for some IDs. I also have a column called TRM_NBR that can take the values 1,2,3, etc. Sequential TRM_NBR's signify contiguous term periods (i.e., three rows with same ID and TRM_NBR 1,2,3 is one contiguous period whereas 1,2,1 is two periods).
I need to reduce the csv file so that I have one row per contiguous term period (and ID). See example below. Specifically, I need to output the ID, STRT_DTE from the first term in the sequence, and the EXP_DTE from the last term in the sequence. (Keeping TRM_NBR is optional.)
Original data:
ID;TRM_NBR;STRT_DTE;EXP_DTE
000020000007;1;2017-08-22 00:00:00.000;2016-09-20 00:00:00.000
000020000009;1;2015-07-23 00:00:00.000;2015-03-24 00:00:00.000
000020000017;1;2014-10-02 00:00:00.000;2014-10-02 00:00:00.000
000020000063;1;2018-11-19 00:00:00.000;2018-11-19 00:00:00.000
000020000063;2;2020-11-19 00:00:00.000;2020-11-19 00:00:00.000
000020000356;1;2020-06-23 00:00:00.000;2020-06-23 00:00:00.000
000020000356;2;2021-05-20 00:00:00.000;2021-05-20 00:00:00.000
000020000356;3;2022-04-21 00:00:00.000;2021-10-21 00:00:00.000
000020000356;2;2014-07-22 00:00:00.000;2014-09-09 00:00:00.000
000020000356;3;2015-07-21 00:00:00.000;2015-07-21 00:00:00.000
000020000356;4;2016-07-12 00:00:00.000;2016-07-12 00:00:00.000
000020000356;5;2017-07-11 00:00:00.000;2017-07-11 00:00:00.000
Desired output:
ID;TRM_NBR;STRT_DTE;EXP_DTE
000020000007;1;2017-08-22 00:00:00.000;2016-09-20 00:00:00.000
000020000009;1;2015-07-23 00:00:00.000;2015-03-24 00:00:00.000
000020000017;1;2014-10-02 00:00:00.000;2014-10-02 00:00:00.000
000020000063;1;2018-11-19 00:00:00.000;2020-11-19 00:00:00.000
000020000356;1;2020-06-23 00:00:00.000;2021-10-21 00:00:00.000
000020000356;2;2014-07-22 00:00:00.000;2017-07-11 00:00:00.000
Contiguous periods can be from 1 to infinity and may or may not start with 1 but will always increment by 1 (if there are more than one row). Output file can contain multiple rows for same ID (but different sequences). You can assume that the file is ordered correctly in terms of sequence.
I can use Python, MySQL or Mac Terminal tools to solve.
CodePudding user response:
The solution for MySQL.
- Create a table for the source data importing into:
CREATE TEMPORARY TABLE tmp (
rowno INT AUTO_INCREMENT PRIMARY KEY,
id CHAR(12),
trm_nbr TINYINT,
strt_dte DATETIME(3),
exp_dte DATETIME(3)
) ENGINE = Memory;
- Import your source data into it:
LOAD DATA INFILE 'X:/folder/filename.CSV'
INTO TABLE tmp
COLUMNS TERMINATED BY ';'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(id, trm_nbr, strt_dte, exp_dte);
- Process the data and save it to the output file:
WITH
cte1 AS ( SELECT *,
LAG(id) OVER (ORDER BY rowno) lag_id,
1 LAG(trm_nbr) OVER (ORDER BY rowno) lag_trm_nbr
FROM tmp ),
cte2 AS ( SELECT *,
SUM(CASE WHEN (id, trm_nbr) = (lag_id, lag_trm_nbr)
THEN 0
ELSE 1
END) OVER (ORDER BY rowno) grp_no
FROM cte1 )
SELECT 'ID', 'TRM_NBR', 'STRT_DTE', 'EXP_DTE'
UNION ALL
SELECT MAX(id) id,
ROW_NUMBER() OVER (PARTITION BY MAX(id) ORDER BY grp_no) trm_nbr,
MIN(strt_dte) strt_dte,
MAX(exp_dte) exp_dte
FROM cte2
GROUP BY grp_no
INTO OUTFILE 'X:/folder/new_filename.CSV'
COLUMNS TERMINATED BY ';'
LINES TERMINATED BY '\r\n';
Your current MySQL account must have FILE privilege.
secure_file_priv
session variable value must be either empty string or some drive:path
(check it by SELECT @@secure_file_priv;
query). If it is not empty string then X:/folder
in the queries must be equal to this value (on Windows system the backslashes in the path value must be doubled or replaced with direct slashes).
If secure_file_priv
session variable value is NULL that you cannot use this method at all. In such case try to ask server administrator to set this variable and to allow the interaction with the filesystem.
The queries uses settings for Windows-style CSV, for Unix-style file use LINES TERMINATED BY '\n'
.
If your source file is too large (above ~100 Mbytes or above max_allowed_packet
session variable value) then remove ENGINE = Memory
from the table definition.
You do not need to delete temporary table - it will be deleted automatically when you close the connection. But you may drop it explicitly nevertheless.
CodePudding user response:
Python alternative based on standard itertools
and 3rd party convtools library:
import itertools
from convtools import conversion as c
from convtools.contrib.tables import Table
# read rows into iterable of dicts
rows = Table.from_csv(
"input.csv", header=True, dialect=Table.csv_dialect(delimiter=";")
).into_iter_rows(dict)
# group into sequences of IDs
key_to_rows = itertools.groupby(rows, key=lambda r: r["ID"])
# prepare the converter, which takes groups of rows from the
# previous step and reduces each group to one row.
# This is where code generation happens, consider storing the
# resulting converter somewhere for further reuse.
converter = c.iter(
c.item(1).pipe(
c.aggregate(
{
"ID": c.ReduceFuncs.First(c.item("ID")),
"LAST_TRM_NBR": c.ReduceFuncs.Last(c.item("TRM_NBR")),
"STRT_DTE": c.ReduceFuncs.First(c.item("STRT_DTE")),
"EXP_DTE": c.ReduceFuncs.Last(c.item("EXP_DTE")),
}
)
)
).gen_converter()
# process iterable from itertools
processed_rows = converter(key_to_rows)
# output processed rows to "out.csv" (consider passing same dialect
# as above to .into_csv method
Table.from_rows(processed_rows).into_csv("out.csv")
The output is:
ID,LAST_TRM_NBR,STRT_DTE,EXP_DTE
000020000007,1,2017-08-22 00:00:00.000,2016-09-20 00:00:00.000
000020000009,1,2015-07-23 00:00:00.000,2015-03-24 00:00:00.000
000020000017,1,2014-10-02 00:00:00.000,2014-10-02 00:00:00.000
000020000063,2,2018-11-19 00:00:00.000,2020-11-19 00:00:00.000
000020000356,5,2020-06-23 00:00:00.000,2017-07-11 00:00:00.000