Python - Loop though dataframe and create class objects-CodePudding

I have the following dataframe (already processed and cleaned to remove special chars, etc.).

parent_id	members_id	item_id	item_name
par_100	member1	item1	t shirt
par_100	member1	item2	denims
par_102	member2	item3	shirt
par_103	member3	item4	shorts
par_103	member3	item5	blouse
par_103	member4	item6	sweater
par_103	member4	item7	hoodie

and following class structure

class Member:
    
    def __init__(self, id):
        self.member_id = id
        self.items = []
        
class Item:
    
    def __init__(self, id, name):
        self.item_id = id
        self.name = name

The number of rows in the dataframe is around 500K . I want to create a dictionary (or other structure) where "parent_id" is the primary key and the columns are mapped to the class objects. After creating the specified data structure. I will be performing some actions based on some business logic where I will have to loop through all the members.

First action is to create the data structure from dataframe. I have following code which does the job but it takes around 3 hours to process all the 500k rows.

# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
    
    for index, parent_key in enumerate(parent_key_list):
    
        temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
        unique_members = temp_data["members_id"].unique()
    
        for us in unique_members:
            items = temp_data.loc[temp_data['members_id'] == us] 
           
            temp_member = Member(items[0]["members_id"])
    
            for index, row in items.iterrows():
                temp_member.items.append(Item(row["item_id"], row["item_name"]))
    
        parent_dict[parent_key].append(temp_member)

Since .loc is very time expensive operation, I tried the same thing with numpy arrays but the performance was much worse. Is there a better approach to reduce the processing time?

CodePudding user response：

You could use iterrows or itertuples to iterate the dataframe and initialize your instances. To make it a bit easier (if you insist on class, personally i would go with a dictionary for both members and items), I would do the following:

Add a member id property to items
Iterate the dataframe and initialize only item instances
Afterwards, you can check all item instances so you can identify unique members and their items

CodePudding user response：

Try this:

from collections import defaultdict

parent_dict = defaultdict(lambda: [])

for (parent_id, members_id), sdf in sorted_data.groupby(['parent_id', 'members_id']):
    member = Member(members_id)
    items = sdf.apply(lambda r: Item(r.item_id, r.item_name), axis=1).to_list()
    member.items.extend(items)
    parent_dict[parent_id].append(member)

It makes use of the .groupby function to partition the dataset for each member. Then you can create the item objects using .apply on the subdataframes generated by .groupby and convert it to a list if Item objects that you can then use to update each member items attribute. Resulting members are stored in a defaultdict that you can convert back to a normal one using dict() (althought they works exactly the same).