I have the following problem:
In case I want to get the columns of a dataframe which have all same strings I use the code that follows:
Let's create the dataframe first: df_example1 = pd.DataFrame({'A':[1,2,3],'B':[1,2,3]})
. Now let's look for the columns that have exactly the same strings:
[(i, j) for i,j in combinations(df_example1, 2) if df_example1[i].equals(df_example1[j])]
The code returns the tuple [('A', 'B')]
My problem is: In case I want to get a tuple of columns which have ONLY two of the strings the same what code should I use? Let's say that my dataframe is the following:
df_example2 = pd.DataFrame({'A':[2,3,4],'B':[1,2,3]})
and it should return the tuple [('A', 'B')].
Thank you in advance :)
CodePudding user response:
You want the intersection of both columns to contain two (or more?) values. You can use the set
class and its operations for this.
df_example2 = pd.DataFrame({'A':[2,3,4],'B':[1,2,3]})
intersect = set(df_example2['A']).intersection(df_example2['B'])
# {2, 3}
Now, if intersect
has 2 (or more?) elements, you want to select the tuple ('A', 'B')
.
[(i, j)
for i,j in combinations(df_example2, 2)
if len(set(df_example2[i]).intersection(df_example2[j])) == 2
]
# Or >= 2 if you want 2 or more
# [('A', 'B')]
Note: The elements of the columns need to be hashable types to be able to create a set