Imagine JSON that looks like this:
[{"name": "foo", 'attr': ['bar', 'baz']}, {...}]
I can easily read this into a DataFrame, no problem. (Python 3.9.2, Pandas 1.5.1)
If I just try it, I get sqlite3.InterfaceError: Error binding parameter 2 - probably unsupported type.
, so it dies on the list type.
What I want to do is use to_sql() or something like it write to a normalized table, e.g.,
name attr
foo bar
foo baz
Is there an easy way of doing this? Also, I have full control over the production of the JSON, so if there is a better JSON format for Pandas, I can reshape the input file.
CodePudding user response:
If I have understood you correctly, you have multiple ways of achieving your goal.
Method 1: Using your current JSON structure
First taking your approach, you have dict like this:
my_json = [{"name": "foo1", 'attr': ['bar1', 'baz1']}, {"name": "foo2", 'attr': ['bar2', 'baz2']}]
and you create a pandas.DataFrame
object liket this:
import pandas as pd
my_df = pd.DataFrame(my_json)
which gives you following:
name attr
0 foo1 [bar1, baz1]
1 foo2 [bar2, baz2]
now since attr
column contains list, which is a object, it can not be exported as such to sqlite
db. So we are going to have to either explode
this column to have my_df
in desired form. For this, we apply following function:
my_df = my_df.explode('attr').reset_index(drop=True)
Now our Dataframe looks like this:
name attr
0 foo1 bar1
1 foo1 baz1
2 foo2 bar2
3 foo2 baz2
all you need to do now is to export it to sqlite. For this we need an engine. I use sqlachemy
to create one
my_engine = sqlalchemy.create_engine('sqlite:///test.db')
my_df.to_sql("test_table_1", my_engine, if_exists="replace", index=False)
And now we have a table.
Method 2: Different JSON Structure
Since you said you have control over structure of data, I personally like doing as less computing on service side as possible. So I think following JSON structure would be the best:
my_json2 = [
{"name": "foo1", 'attr': 'bar1'},
{"name": "foo1", 'attr': 'baz1'},
{"name": "foo2", 'attr': 'bar2'},
{"name": "foo2", 'attr': 'baz2'},
]
Downside of this is, the JSON size (i.e. number of lines) increases. But loading into df is straight forward.
my_df2 = pd.DataFrame(my_json2)
which gives me:
name attr
0 foo1 bar1
1 foo1 baz1
2 foo2 bar2
3 foo2 baz2
Now we can export in similar way as method 1,
my_df2.to_sql('test_table_2', my_engine, if_exists='replace', index=False)