Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020)
, Type (05/19/2020)
, etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb
column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id | activity |
---|---|
12345678 | {"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"} |
98765432 | {"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"} |
JSON to columns
Using the amazing create_jsonb_flat_view
function from this post I can convert the jsonb to columns like this:
id | Date (05/19/2020) | Type (05/19/2020) | Date (06/03/2020) | Type (06/03/2020) | Type (10/23/2020 | Date (10/23/2020) | Type (10/23/2020) |
---|---|---|---|---|---|---|---|
10629465 | null | null | 06/01/2020 | E | |||
98765432 | 05/18/2020 | B | 10/26/2020 | T |
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id
and ActivityDate
with additional columns for Date
and Type
like this:
id | ActivityDate | Date | Type |
---|---|---|---|
12345678 | 05/19/2020 | null | null |
12345678 | 06/03/2020 | 06/01/2020 | E |
98765432 | 05/19/2020 | 05/18/2020 | B |
98765432 | 10/23/2020 | 10/26/2020 | T |
CodePudding user response:
I followed your link to the create_jsonb_flat_view
article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace()
works here. I left the date values as strings, but you can convert them with to_date()
if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s \([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^. ([0-9/]{10}). ', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
CodePudding user response:
@Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace()
calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space
between Date
and (05/19/2020)
.
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;