So i have a series containing a lot of json responses. And it is quite a list of big jsons.
Display sample
{0: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityCpfRequest","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 1: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityStart","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 2: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"cpfValidationTrue","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 3: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityCpfRequest","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 4: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityStart","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}'}
I had a few issues, trying to load only portions (of the json object) efficiently and quickly. Ran into issues such as too much memory usage (when running pandas functions). And too slow processing.
So I made the following code
import orjson
def dataset_extras(extras #Series being passed,*args # List of keys you want to unload):
l = []
for i in extras:
l.append({arg : orjson.loads(i).get(arg) for arg in args})
return pd.DataFrame.from_records(l)
dataset_extras(df.Extras,'city','campaign-source','api-orders-hash-id')
# Sample of Call
This time i managed to circumvent, a lot of the performance issues. But I was wondering if there was a even more efficient way of transforming portions of a series of json responses, into a pd.DataFrame()
. Would appreciate some feedback on a way I could improve this code.
CodePudding user response:
As commented, you can probably optimize things quite a bit by not parsing JSON over and over again for each arg
:
def dataset_extras(
json_strings,
keys,
):
records = []
for json_string in json_strings:
datum = orjson.loads(json_string)
records.append({key: datum.get(key) for key in keys})
return pd.DataFrame.from_records(records)
x = dataset_extras(df.Extras, ["city", "campaign-source", "api-orders-hash-id"])
Another approach might be to build the df from a dict-of-lists. You'll have to measure if this is faster than from_records
.
def dataset_extras(
json_strings,
keys,
):
columns = {col: [] for col in keys}
for json_string in json_strings:
datum = orjson.loads(json_string)
for key in keys:
columns[key].append(datum.get(key))
return pd.DataFrame(columns)