I have 2 arrays.
Array users
contains some user data and has 3.5 mil. records.
Array transactions
contains some transactions data and has 140k records.
Now I want to loop through each user and look up which types of transactions he did and save this information in a new array, lets call it Array C, so Array C will also have 3.5 mil. records.
Later I want to use Array C to create a point cloud.
I am doing it really simple right now:
with open('final/users.json', 'r') as f:
users = json.load(f)
with open('final/transactions.json', 'r') as f:
transactions = json.load(f)
result = []
count = 0
for user in users:
data_elem = {'user': user, 'memberships': 0, 'credits': 0}
print('working on: ' str(user['id']) ' count: ' str(count))
for transaction in transactions:
if user['id'] == transaction['user_id']:
if transaction['type'] == 'membership':
data_elem['memberships'] = 1
elif transaction['type'] == 'credits':
data_elem['credits'] = 1
else:
print('unknow type: ' transaction['type'])
break
result.append(data_elem)
count = 1
# Just for testing
if count == 2000:
break
with open('final/analysed/membership_credits_buys.json', 'w') as f:
json.dump(result, f)
With this approach the script will be active for 2 hours~ the time I will need to loop again and create the point cloud. (Maybe its better to create the point cloud already in the first loop).
I am just curious whether there is a faster approach. I have found numpy
and I am currently looking how it can help, but maybe someone here knows a better library or approach.
CodePudding user response:
The file I/O part and JSON parsing cannot be optimised further unless you would move away from JSON and store your data in a relational database. But assuming this is how it has to be, we can focus on the processing part:
Instead of iterating the complete list of transactions for each user, it will be better to first create a dictionary from the users list first, keyed by user id, and then to populate that dictionary with the corresponding transactions. This way you only need to loop over the transactions once.
A minor thing: keep the name of the keys the same. So don't translate "membership" to "memberships"... this is just overhead that is not worth it.
result = {
user['id']: {'user': user, 'membership': 0, 'credits': 0}
for user in users
}
for transaction in transactions:
data_elem = result[transaction['user_id']]
try:
data_elem[transaction['type']] = 1
except:
print('unknown type: ' transaction['type'])
break
# convert dict to list
result = list(result.values())
Database design
In case you have the option to use an SQL database, then the schema should be designed somewhat like this:
Users
table:
- Primary key:
user_id
- Other columns, like maybe
name
,address
,dob
, ...etc
Transactions
table:
- Primary key:
transaction_id
- Foreign key:
user_id
(having non-unique index) type
- Other columns, like maybe
amount
,transaction_date
, ...etc
Then the SQL to retrieve the result would be:
select u.user_id,
min(u.name) name,
min(u.address) address,
-- ...etc...
count(case t.type when 'membership' then 1 end) memberships,
count(case t.type when 'credits' then 1 end) credits
from Users u
left join Transactions t
on u.user_id = t.user_id
group by u.user_id