I am new to python programming, and I am trying to think how to create multiple excels from a data frame. I have a Pandas data frame as shown below,
Invoice No. | Voucher ID | Claimed |
---|---|---|
MHI000000038710 | 100039 | Yes |
MHI000000038715 | 100039 | No |
MHI000000038711 | 100043 | Yes |
MHI000000038712 | 100043 | No |
For every Voucher ID, I am trying to create excels for every Invoice No. from data frame where Claimed = Yes. But every excel should contain all the rows for same Voucher ID irrespective of Claimed is Yes/No. For the above example, the output would be 2 excels named after each Invoice No. where Claimed = Yes(i.e., MHI000000038710and MHI000000038711).
And each excel should contains rows grouped by Voucher ID
Excel 1(MHI000000038710.xlsx): -
Invoice No. | Voucher ID | Claimed |
---|---|---|
MHI000000038710 | 100039 | Yes |
MHI000000038715 | 100039 | No |
Excel 2(MHI000000038711.xlsx): -
Invoice No. | Voucher ID | Claimed |
---|---|---|
MHI000000038711 | 100043 | Yes |
MHI000000038712 | 100043 | No |
The script that I have currently is creating excel for all the invoices. I am not able to figure out how not create invoices where Claimed = No but still have all the rows in the invoices created for Claimed = Yes
for myid in df['Voucher ID'].unique():
df_singleID = df[df['Voucher ID']==myid]
for myinvoice in df_singleID['Invoice No'].unique():
df_singleID.to_excel(output "\\" str(myinvoice) '.xlsx',index = False)
CodePudding user response:
What about using pandas.DataFrame.groupby
?
for _, g in df.groupby("Voucher ID"):
filename = g.loc[g["Claimed"].eq("Yes"), "Invoice No."].iloc[0]
g.to_excel(f"{filename}.xlsx", index=False)
NB: The filename is calculated with boolean indexing.
CodePudding user response:
You wont get the index error using .query() and list comps:
from itertools import zip_longest
import pandas as pd
path = "path/to/file"
invoices = df.query("Claimed.eq('Yes')")["Invoice No."].tolist()
dfs = [x.reset_index(drop=True) for _, x in df.groupby("Voucher ID")]
[df.to_excel(f"{path}/{file_name}.xlsx", index=False) for file_name, df in zip_longest(invoices, dfs)]