In this question I got help with flatten of each row in a column of a dataframe.
[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}]
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}]
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
The code is
df = pd.DataFrame()
for d in data:
df_tmp = pd.json_normalize(d)
row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
row.columns = df_tmp.columns
df = pd.concat([df, row])
print(df.reset_index(drop=True))
The code turns data into
first_open_time.int_value first_open_time.set_timestamp_micros User_dedication.string_value ... Paying_user.set_timestamp_micros ga_session_id.int_value ga_session_id.set_timestamp_micros
0 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
1 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
2 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
3 1653195600000 1653193960416000 NaN ... NaN NaN NaN
4 1653102000000 1653098744032000 1 ... NaN 1653165977 1653165977727000
This code works good but takes enormous amount of time. To go over 1 column of 1 day of data (660k rows) it took about 10 hours. And there are 2 such columns. Means python should work 24/7.
Is there a way to rewrite it to make more time efficient?
P.S. Also, to make that column I used this code:
def transformation_user_properties(row):
return [{elem['key']: elem['value']} for elem in row['user_properties']]
df['user_properties'] = df.apply(transformation_user_properties, axis=1)
Which transform colums with rows of this form
[{'a': 'b', 'c': {'c1': 'v1', 'c2': 'v2'}}, {'a': 'b1', 'c': {'c1': 'x1', 'c2': 'x2'}}, {'a': 'b2', 'c': {'c1': 'n1', 'c2': 'n2'}}]
into this form
[{'b': {'c1': 'v1', 'c2': 'v2'}}, {'b1': {'c1': 'x1', 'c2': 'x2'}}, {'b2': {'c1': 'n1', 'c2': 'n2'}}]
CodePudding user response:
You can use nested list and dict comprehension:
data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
]
L = [{f'{k}.{k1}': v1 for y in x
for k, v in y.items()
for k1, v1 in v.items()}
for x in data]
df = pd.DataFrame(L)
print (df)
first_open_time.int_value first_open_time.set_timestamp_micros \
0 1652796000000 1652792823456000
1 1652796000000 1652792823456000
2 1652796000000 1652792823456000
3 1653195600000 1653193960416000
4 1653102000000 1653098744032000
User_dedication.string_value User_dedication.set_timestamp_micros \
0 1 1653137417352000
1 1 1653137166688000
2 1 1653137288213000
3 NaN NaN
4 1 1653109414734000
User_activity.string_value User_activity.set_timestamp_micros \
0 1 1653136561498000
1 1 1653136561498000
2 1 1653136561498000
3 NaN NaN
4 1_10 1653109414730000
Minutes_in_app.string_value Minutes_in_app.set_timestamp_micros \
0 60_300 1653137417352000
1 60_300 1653137166688000
2 60_300 1653137288213000
3 NaN NaN
4 1_10 1653109414735000
ga_session_number.int_value ga_session_number.set_timestamp_micros \
0 10 1653136552555000
1 10 1653136552555000
2 10 1653136552555000
3 NaN NaN
4 3 1653165977727000
Paying_user.string_value Paying_user.set_timestamp_micros \
0 0 1653136561498000
1 0 1653136561498000
2 0 1653136561498000
3 NaN NaN
4 NaN NaN
ga_session_id.int_value ga_session_id.set_timestamp_micros
0 1653136552 1653136552555000
1 1653136552 1653136552555000
2 1653136552 1653136552555000
3 NaN NaN
4 1653165977 1653165977727000
Performance for 5k values:
data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
]*1000
In [92]: %%timeit
...: df = pd.DataFrame()
...:
...: for d in data:
...: df_tmp = pd.json_normalize(d)
...: row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
...: row.columns = df_tmp.columns
...: df = pd.concat([df, row])
...:
...: df.reset_index(drop=True)
...:
...:
36.9 s ± 1.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [93]: %%timeit
...: L = [{f'{k}.{k1}': v1
...: for y in x
...: for k, v in y.items()
...: for k1, v1 in v.items()} for x in data]
...:
...: pd.DataFrame(L)
...:
...:
40.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)