Home > Back-end >  Iterate through a list of 3 mil. JSON objects | Performance
Iterate through a list of 3 mil. JSON objects | Performance

Time:08-05

I have 2 arrays.

Array users contains some user data and has 3.5 mil. records.

Array transactions contains some transactions data and has 140k records.

Now I want to loop through each user and look up which types of transactions he did and save this information in a new array, lets call it Array C, so Array C will also have 3.5 mil. records.

Later I want to use Array C to create a point cloud.

I am doing it really simple right now:

with open('final/users.json', 'r') as f:
  users = json.load(f)

with open('final/transactions.json', 'r') as f:
  transactions = json.load(f)

result = []
count = 0

for user in users:
  data_elem = {'user': user, 'memberships': 0, 'credits': 0}
  print('working on: '   str(user['id'])   ' count: '   str(count))
  for transaction in transactions:
    if user['id'] == transaction['user_id']:
      if transaction['type'] == 'membership':
        data_elem['memberships']  = 1
      elif transaction['type'] == 'credits':
        data_elem['credits']  = 1
      else:
        print('unknow type: '   transaction['type'])
        break
  result.append(data_elem)
  count  = 1
  # Just for testing
  if count == 2000:
    break

with open('final/analysed/membership_credits_buys.json', 'w') as f:
  json.dump(result, f)

With this approach the script will be active for 2 hours~ the time I will need to loop again and create the point cloud. (Maybe its better to create the point cloud already in the first loop).

I am just curious whether there is a faster approach. I have found numpy and I am currently looking how it can help, but maybe someone here knows a better library or approach.

CodePudding user response:

The file I/O part and JSON parsing cannot be optimised further unless you would move away from JSON and store your data in a relational database. But assuming this is how it has to be, we can focus on the processing part:

Instead of iterating the complete list of transactions for each user, it will be better to first create a dictionary from the users list first, keyed by user id, and then to populate that dictionary with the corresponding transactions. This way you only need to loop over the transactions once.

A minor thing: keep the name of the keys the same. So don't translate "membership" to "memberships"... this is just overhead that is not worth it.

result = {
    user['id']: {'user': user, 'membership': 0, 'credits': 0}
    for user in users
}
for transaction in transactions:
      data_elem = result[transaction['user_id']]
      try:
          data_elem[transaction['type']]  = 1
      except:
          print('unknown type: '   transaction['type'])
          break
# convert dict to list
result = list(result.values())

Database design

In case you have the option to use an SQL database, then the schema should be designed somewhat like this:

Users table:

  • Primary key: user_id
  • Other columns, like maybe name, address, dob, ...etc

Transactions table:

  • Primary key: transaction_id
  • Foreign key: user_id (having non-unique index)
  • type
  • Other columns, like maybe amount, transaction_date, ...etc

Then the SQL to retrieve the result would be:

select    u.user_id,
          min(u.name) name,
          min(u.address) address,
          -- ...etc...
          count(case t.type when 'membership' then 1 end) memberships,
          count(case t.type when 'credits' then 1 end) credits
from      Users u
left join Transactions t
       on u.user_id = t.user_id
group by  u.user_id
  • Related