Home > Mobile >  Python - Loop though dataframe and create class objects
Python - Loop though dataframe and create class objects

Time:06-02

I have the following dataframe (already processed and cleaned to remove special chars, etc.).

parent_id members_id item_id item_name
par_100 member1 item1 t shirt
par_100 member1 item2 denims
par_102 member2 item3 shirt
par_103 member3 item4 shorts
par_103 member3 item5 blouse
par_103 member4 item6 sweater
par_103 member4 item7 hoodie

and following class structure

class Member:
    
    def __init__(self, id):
        self.member_id = id
        self.items = []
        
class Item:
    
    def __init__(self, id, name):
        self.item_id = id
        self.name = name

The number of rows in the dataframe is around 500K . I want to create a dictionary (or other structure) where "parent_id" is the primary key and the columns are mapped to the class objects. After creating the specified data structure. I will be performing some actions based on some business logic where I will have to loop through all the members.

First action is to create the data structure from dataframe. I have following code which does the job but it takes around 3 hours to process all the 500k rows.

# sorted_data is the dataframe mentioned above
parent_key_list = sorted_data['parent_id'].unique().tolist()
    
    for index, parent_key in enumerate(parent_key_list):
    
        temp_data = sorted_data.loc[sorted_data['parent_id'] == parent_key]
        unique_members = temp_data["members_id"].unique()
    
        for us in unique_members:
            items = temp_data.loc[temp_data['members_id'] == us] 
           
            temp_member = Member(items[0]["members_id"])
    
            for index, row in items.iterrows():
                temp_member.items.append(Item(row["item_id"], row["item_name"]))
    
        parent_dict[parent_key].append(temp_member)

Since .loc is very time expensive operation, I tried the same thing with numpy arrays but the performance was much worse. Is there a better approach to reduce the processing time?

CodePudding user response:

You could use iterrows or itertuples to iterate the dataframe and initialize your instances. To make it a bit easier (if you insist on class, personally i would go with a dictionary for both members and items), I would do the following:

  • Add a member id property to items
  • Iterate the dataframe and initialize only item instances
  • Afterwards, you can check all item instances so you can identify unique members and their items

CodePudding user response:

Try this:

from collections import defaultdict

parent_dict = defaultdict(lambda: [])

for (parent_id, members_id), sdf in sorted_data.groupby(['parent_id', 'members_id']):
    member = Member(members_id)
    items = sdf.apply(lambda r: Item(r.item_id, r.item_name), axis=1).to_list()
    member.items.extend(items)
    parent_dict[parent_id].append(member)

It makes use of the .groupby function to partition the dataset for each member. Then you can create the item objects using .apply on the subdataframes generated by .groupby and convert it to a list if Item objects that you can then use to update each member items attribute. Resulting members are stored in a defaultdict that you can convert back to a normal one using dict() (althought they works exactly the same).

  • Related