Home > Software design >  Split and store dataframe but name based on unique values in specific cols
Split and store dataframe but name based on unique values in specific cols

Time:07-31

I have a dataframe like as below

data = pd.DataFrame({'email_id': ['[email protected];[email protected]','[email protected];[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]'],
             'Dept_id': [21,23,25,26,28,29,31],
             'dept_name':['Science','Chemistry','Maths','Social','Physics','Botany','Zoology'],
             'location':['KOR','ASN','ANZ','IND','AUS','NZ','NZ']})

I would like to do the below

a) Split the dataframe based the unique email_id values in the email_id column

b) Store the split data in a excel file. each email_id row will have one file.

c) Name the excel file based on corresponding unique values present in dept_id, dept_name and location columns

I tried the below

for k, v in data.groupby(['email_id']):
    dept_unique_ids = v['Dept_id'].unique()
    dept_unique_names = v['dept_name'].unique()
    location_unique = v['location'].unique()
    writer = pd.ExcelWriter(f"{k}.xlsx", engine='xlsxwriter')
    v.to_excel(writer,columns=col_list,sheet_name=f'{k}',index=False, startrow = 1)

While the above code splits the file successfully but it names it based on the email_id which is used as key. Instead I want to name the file based on dept_id, dept_name and location for that specific key.

For ex: If you take email_id = [email protected], they have two unique dept_ids which is 29 and 31, their unique dept_name is Botany and Zoology and unique location is NZ.

So, I want my file name to be 29_31_Botany_Zoology_NZ.xlsx.

Therefore, I expect my output files (for each unique email id row to have filenames like below)

enter image description here

update - error message

enter image description here

CodePudding user response:

IIUC, you can use:

for k, v in data.groupby(['email_id']):
    dept_unique_ids = '_'.join(v['Dept_id'].astype(str).unique())
    dept_unique_names = '_'.join(v['dept_name'].unique())
    location_unique = '_'.join(v['location'].unique())
    filename = '_'.join([dept_unique_ids, dept_unique_names, location_unique])
    print(filename)

    # Use a context manager
    with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
        v.to_excel(writer,columns=col_list,sheet_name=f'{k}',index=False, startrow=1)

Output:

23_Chemistry_ASN
21_Science_KOR
25_Maths_ANZ
26_Social_IND
28_Physics_AUS
29_31_Botany_Zoology_NZ
  • Related