Return specific values of a dataframe based on a condition-CodePudding

I have a list like:

list_of_list = [["aa", "yy"], ["gg", "xx"]]

and a pandas dataframe like this:

   month   column1  column2
0  June     xx        aa               
1  June     gg        xx                
2  August   xx        yy

I go through each sub_list of the list_of_lists and check if each item of the sub_list exists in any of the columns of the dataframe.

for sub_list in list_of_lists:
    print("\n")
    print("Sub_list: ", sub_list)
    for item in sub_list:
        for col in df.columns:
            if item in df[col].values:
                print(df[col][df[col] == item].to_frame())

This returns me:

Sub_list:  ['aa', 'yy']

  column2
0      aa

  column2
2      yy



Sub_list:  ["gg", "xx"]

  column1
1      gg

  column1
0      xx

  column1
2      xx

  column2
1      xx

What I want is except to the above to return also the index and the corresponding month when and only when, each sub_list of the list_of_lists is not in the same index

So for list_of_list = [["aa", "yy"]] except to the above code i want also to return

0  June   aa
2  August yy

Any ideas?

CodePudding user response：

Use melt to flat your original dataframe and create a dataframe from your list_of_list then merge them and finally remove duplicates in two pass.

Step 1. Format your dataframes

df1 = df.melt('month', ignore_index=False).reset_index()
df2 = pd.DataFrame(list_of_list).melt(ignore_index=False)['value'].reset_index()
print(df1, df2)

# Output
# df1 (from your original df)
   index   month variable value
0      0    June  column1    xx
1      1    June  column1    gg
2      2  August  column1    xx
3      0    June  column2    aa
4      1    June  column2    xx
5      2  August  column2    yy

# df2 (from list_of_list)
   index value
0      0    aa
1      1    gg
2      0    yy
3      1    xx

Step 2. Merge your two dataframes

out = df1.merge(df2, on='value')
print(out)

# Output:
   index_x   month variable value  index_y
0        0    June  column1    xx        1
1        2  August  column1    xx        1
2        1    June  column2    xx        1
3        1    June  column1    gg        1
4        0    June  column2    aa        0
5        2  August  column2    yy        0

Step 3. Keep only right rows

out = out.drop_duplicates(subset=['index_x', 'index_y'], keep=False) \
         .drop_duplicates('value', keep=False)
print(out)

# Output:
   index_x   month variable value  index_y
4        0    June  column2    aa        0
5        2  August  column2    yy        0

Step 4. Final output

out = out.set_index('index_x').rename_axis(index=None)[['month', 'value']]
print(out)

# Output:
    month value
0    June    aa
2  August    yy

CodePudding user response：

Could use the any to achieve the same.

for sub_list in list_of_lists:
    print("\n")
    print("Sub_list: ", sub_list)
    for item in sub_list:
        print(df[(df == item).any(axis=1)])

CodePudding user response：

stack the DataFrame and then loop through your lists:

df["index"] = df.index
srs = df.set_index(["index", "month"]).stack().rename("value")
for sublist in list_of_list:
    print(f"\nSub list: {sublist}")
    print(srs[srs.isin(sublist)].reset_index(1).reset_index(1, drop=True))

Output:

Sub list: ['aa', 'yy']
        month value
index              
0        June    aa
2      August    yy

Sub list: ['gg', 'xx']
        month value
index              
0        June    xx
1        June    gg
1        June    xx
2      August    xx