get a list of tuples with the columns that have only two same strings:-CodePudding

I have the following problem:

In case I want to get the columns of a dataframe which have all same strings I use the code that follows:

Let's create the dataframe first: df_example1 = pd.DataFrame({'A':[1,2,3],'B':[1,2,3]}) . Now let's look for the columns that have exactly the same strings: [(i, j) for i,j in combinations(df_example1, 2) if df_example1[i].equals(df_example1[j])] The code returns the tuple [('A', 'B')]

My problem is: In case I want to get a tuple of columns which have ONLY two of the strings the same what code should I use? Let's say that my dataframe is the following: df_example2 = pd.DataFrame({'A':[2,3,4],'B':[1,2,3]}) and it should return the tuple [('A', 'B')].

Thank you in advance :)

CodePudding user response：

You want the intersection of both columns to contain two (or more?) values. You can use the set class and its operations for this.

df_example2 = pd.DataFrame({'A':[2,3,4],'B':[1,2,3]})
intersect = set(df_example2['A']).intersection(df_example2['B'])
# {2, 3}

Now, if intersect has 2 (or more?) elements, you want to select the tuple ('A', 'B').

[(i, j) 
    for i,j in combinations(df_example2, 2) 
    if len(set(df_example2[i]).intersection(df_example2[j])) == 2
]
# Or >= 2 if you want 2 or more
# [('A', 'B')]

Note: The elements of the columns need to be hashable types to be able to create a set