How to loop through and return any value if it is found inside any other column within a dataframe u-CodePudding

How to loop through and return any value if it is found inside any other column, and store it in a list using pandas? It doesn't matter how many times it is found, just that it is found at least one more time in a different column. If the value has repeated within the same column, it's not included in the list. Each value must be compared to every other value except from within the same column, if that makes sense.

combined_insp = []
test_df = pd.DataFrame({'area_1': ['John', 'Mike', 'Mary', 'Sarah'],
                        'area_2': ['John', 'Bob', 'Mary', 'Mary'],
                        'area_3': ['Jane', 'Sarah', 'David', 'Michael'],
                        'area_4': ['Diana', 'Mike', 'Bill', 'Bill']})

Expected output would be

combined_insp = ['John', 'Mary', 'Sarah', 'Mike']

CodePudding user response：

A solution with itertools and set algebra:

from itertools import combinations

combined_insp = set.union(*[set(test_df[c1]).intersection(test_df[c2]) 
                            for (c1, c2) in combinations(test_df.columns, 2)])

For each unique combination of columns we take the intersection of the values. Then we take the union of all the results.

CodePudding user response：

You can use pandas.apply(set) for removing duplicated elements in each list. Then You can use itertools.chain.from_iterable to flatten all elements to one list. At the end, you can use collections.Counter for counting elements and returning elements that have count > 1. (type of the result of Counter is dict and you can iterate over dict with dict.items().)

from itertools import chain
from collections import Counter
combined_insp = [k for k,v in Counter(chain.from_iterable(test_df.apply(set))).items() if v>1]
print(combined_insp)

['Sarah', 'Mike', 'Mary', 'John']

CodePudding user response：

here is one way to do it

# pd.melt to flatted the table, then use groupby and take the names that appear more than once

g=df.melt(value_name='area').drop_duplicates().groupby('area')
[key for key, group in g if (group.count() > 1).all() ]

['John', 'Mary', 'Mike', 'Sarah']

CodePudding user response：

counts = df.melt().drop_duplicates()['value'].value_counts()
answer = counts[counts > 1].index.to_list()