How to split dataframe into 2 excel files based on filter-CodePudding

New python user here. I'm trying to figure out how to split a dataframe into 2 excel files, filtered based on whether 2 of the columns contain a specific set of words ("blue" or "mozzarella"). If either of the 2 columns DO contain any of the words, I want that entire row added to one excel file. If they DO NOT contain the words, I want that entire row added to a different excel file.

>>> data

    cheese 1    cheese 2    cheese 3
0   blue        mozzarella  camembert
1   mozzarella  munster     edam
2   maccagno    mozzarella  ricotta
3   brie        berkswell   parmigiano

I've searched around and attempted the below, but it gives me an invalid index error for data.at[row, 'cheese 1'].

writer = pd.ExcelWriter('./Included cheese.xlsx', engine='xlsxwriter')
writer_2 = pd.ExcelWriter('./Excluded cheese.xlsx', engine='xlsxwriter')

pattern = 'blue|mozzarella'

for index, row in data.iterrows():
    if pattern in data.at[row, 'cheese 1']|data.at[row, 'cheese 2']:
        data.at[row:].to_excel(writer)
    else:
        data.at[row:].to_excel(writer_2)

How should I fix this to achieve my desired filtered outputs?

CodePudding user response：

If I read the question correctly, you can filter the overall dataframe into two dataframes, one with the pattern and one without. Then write the separate dataframes to excel. Using the | you can write an or statement to filter the data by checking if the pattern is in either column.

pattern_df = data[(data['cheese 1'].str.contains(pattern)) | (data['cheese 2'].str.contains(pattern))]

then you can use ~ to check for rows without the pattern, as well as change up the logic of the filter with & to find all rows without the pattern in col 1 or col 2.

non_pattern_df = data[~(data['cheese 1'].str.contains(pattern)) & ~(data['cheese 2'].str.contains(pattern))]

Hope this helps.

CodePudding user response：

Not very familiar with Pandas and how to deal with xlsx file, but just try with the code below and this works for me:

   import pandas as pd
    data = pd.read_excel("test.xlsx")
    columns = data.columns.values
    check_set = {"blue","mozzarella"}
    condational = data[columns[0]].isin(check_set) | data[columns[1]].isin(check_set)
    data_in = data[condational]
    data_out = data[~condational]
    data_in.to_excel("Included_cheese.xlsx",index=False)
    data_out.to_excel("Excluded_cheese.xlsx",index=False)

At first, I tried to write a xlsx file as in your code

writer = pd.ExcelWriter('Included_cheese.xlsx', engine='xlsxwriter')
writer_2 = pd.ExcelWriter('Excluded_cheese.xlsx', engine='xlsxwriter')
data_in.to_excel(writer)
data_out.to_excel(writer_2)

but I can not open with the written file with excel in this way.