I have the following dataframe (already processed and cleaned to remove special chars, etc.).
parent_id | members_id | item_id | item_name |
---|---|---|---|
par_100 | member1 | item1 | t shirt |
par_100 | member1 | item2 | denims |
par_102 | member2 | item3 | shirt |
par_103 | member3 | item4 | shorts |
par_103 | member3 | item5 | blouse |
par_103 | member4 | item6 | sweater |
par_103 | member4 | item7 | hoodie |
and following class structure
class Member:
def __init__(self, id):
self.member_id = id
self.items = []
class Item:
def __init__(self, id, name):
self.item_id = id
self.name = name
The number of rows in the dataframe is around 500K . I want to create a dictionary (or other structure) where "parent_id" is the primary key and the columns are mapped to the class objects. After creating the specified data structure. I will be performing some actions based on some business logic where I will have to loop through all the members.
First action is to create the data structure from dataframe. I have following code which does the job but it takes around 3 hours to process all the 500k rows.
# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
for index, parent_key in enumerate(parent_key_list):
temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
unique_members = temp_data["members_id"].unique()
for us in unique_members:
items = temp_data.loc[temp_data['members_id'] == us]
temp_member = Member(items[0]["members_id"])
for index, row in items.iterrows():
temp_member.items.append(Item(row["item_id"], row["item_name"]))
parent_dict[parent_key].append(temp_member)
Since .loc
is very time expensive operation, I tried the same thing with numpy arrays but the performance was much worse. Is there a better approach to reduce the processing time?
CodePudding user response:
You could use iterrows or itertuples to iterate the dataframe and initialize your instances. To make it a bit easier (if you insist on class, personally i would go with a dictionary for both members and items), I would do the following:
- Add a member id property to items
- Iterate the dataframe and initialize only item instances
- Afterwards, you can check all item instances so you can identify unique members and their items
CodePudding user response:
Try this:
from collections import defaultdict
parent_dict = defaultdict(lambda: [])
for (parent_id, members_id), sdf in sorted_data.groupby(['parent_id', 'members_id']):
member = Member(members_id)
items = sdf.apply(lambda r: Item(r.item_id, r.item_name), axis=1).to_list()
member.items.extend(items)
parent_dict[parent_id].append(member)
It makes use of the .groupby
function to partition the dataset for each member. Then you can create the item objects using .apply
on the subdataframes generated by .groupby
and convert it to a list if Item
objects that you can then use to update each member items
attribute. Resulting members are stored in a defaultdict
that you can convert back to a normal one using dict()
(althought they works exactly the same).