I have a list like:
list_of_list = [["aa", "yy"], ["gg", "xx"]]
and a pandas dataframe like this:
month column1 column2
0 June xx aa
1 June gg xx
2 August xx yy
I go through each sub_list of the list_of_lists and check if each item of the sub_list exists in any of the columns of the dataframe.
for sub_list in list_of_lists:
print("\n")
print("Sub_list: ", sub_list)
for item in sub_list:
for col in df.columns:
if item in df[col].values:
print(df[col][df[col] == item].to_frame())
This returns me:
Sub_list: ['aa', 'yy']
column2
0 aa
column2
2 yy
Sub_list: ["gg", "xx"]
column1
1 gg
column1
0 xx
column1
2 xx
column2
1 xx
What I want is except to the above to return also the index and the corresponding month when and only when, each sub_list of the list_of_lists is not in the same index
So for list_of_list = [["aa", "yy"]]
except to the above code i want also to return
0 June aa
2 August yy
Any ideas?
CodePudding user response:
Use melt
to flat your original dataframe and create a dataframe from your list_of_list
then merge them and finally remove duplicates in two pass.
Step 1. Format your dataframes
df1 = df.melt('month', ignore_index=False).reset_index()
df2 = pd.DataFrame(list_of_list).melt(ignore_index=False)['value'].reset_index()
print(df1, df2)
# Output
# df1 (from your original df)
index month variable value
0 0 June column1 xx
1 1 June column1 gg
2 2 August column1 xx
3 0 June column2 aa
4 1 June column2 xx
5 2 August column2 yy
# df2 (from list_of_list)
index value
0 0 aa
1 1 gg
2 0 yy
3 1 xx
Step 2. Merge your two dataframes
out = df1.merge(df2, on='value')
print(out)
# Output:
index_x month variable value index_y
0 0 June column1 xx 1
1 2 August column1 xx 1
2 1 June column2 xx 1
3 1 June column1 gg 1
4 0 June column2 aa 0
5 2 August column2 yy 0
Step 3. Keep only right rows
out = out.drop_duplicates(subset=['index_x', 'index_y'], keep=False) \
.drop_duplicates('value', keep=False)
print(out)
# Output:
index_x month variable value index_y
4 0 June column2 aa 0
5 2 August column2 yy 0
Step 4. Final output
out = out.set_index('index_x').rename_axis(index=None)[['month', 'value']]
print(out)
# Output:
month value
0 June aa
2 August yy
CodePudding user response:
Could use the any
to achieve the same.
for sub_list in list_of_lists:
print("\n")
print("Sub_list: ", sub_list)
for item in sub_list:
print(df[(df == item).any(axis=1)])
CodePudding user response:
stack
the DataFrame and then loop through your lists:
df["index"] = df.index
srs = df.set_index(["index", "month"]).stack().rename("value")
for sublist in list_of_list:
print(f"\nSub list: {sublist}")
print(srs[srs.isin(sublist)].reset_index(1).reset_index(1, drop=True))
Output:
Sub list: ['aa', 'yy']
month value
index
0 June aa
2 August yy
Sub list: ['gg', 'xx']
month value
index
0 June xx
1 June gg
1 June xx
2 August xx