How to preform loc with one condition that include two columns-CodePudding

Hello guys I need your help.

I have df with two columns A and B both of them are columns with string values

example:

df_1 = pd.DataFrame(data={
    "A":['a','b','c'],
    "B":['a x d','z y w','q m c'] #string values not a list
})
print(df_1)

#output
   A      B
0  a  a x d
1  b  z y w
2  c  q m c

now what I'm trying to do is to preform loc in the df_1 to get all the row that col B cointain the string value in col A.

In this example the output i want is the first and the third rows

   A      B
0  a  a x d # 'a x d' contain value 'a'
2  c  q m c # 'q m c' contain value 'c'

I have tried different loc condition but got unhashable type: 'Series' error:

df_1.loc[df_1["B"].str.contains(df_1["A"])] #TypeError: unhashable type: 'Series'
df_1.loc[df_1["A"] in df_1["B"]] #TypeError: unhashable type: 'Series'

I really don't want to use a for/while loop because of the size of the df.

any idea how can I preform this? Thanks

CodePudding user response：

Cute!

from operator import contains

df_1.loc[map(contains, *map(df_1.get, ['B', 'A']))]

   A      B
0  a  a x d
2  c  q m c

And without the import but with an ugly dunder ... /-:

df_1.loc[map(str.__contains__, *map(df_1.get, ['B', 'A']))]

Also, to the OP... don't accept this answer. I was just golfing. You should never use this in production code as it is less than transparent due to the use of 2 maps and a splat.

And just to be super duper clear...

mozway's is the better one

CodePudding user response：

There is no vectorial method, to map in using two columns. You need to loop here:

mask = [a in b for a,b in zip(df_1['A'], df_1['B'])]

df_1.loc[mask]

Output:

   A      B
0  a  a x d
2  c  q m c

comparison of speed (3000 rows)

# operator.contains
518 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# list comprehension
554 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# numpy.apply_along_axis
7.32 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# apply
20.7 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

You can try:

out = df_1.loc[df_1.apply(lambda x: x['A'] in x['B'], axis=1)]
print(out)

# Output
   A      B
0  a  a x d
2  c  q m c

Alternative with numpy:

df_1.loc[np.apply_along_axis(lambda x: x[0] in x[1], axis=1, arr=df_1)]

CodePudding user response：

When accessing two different columns in pandas you can use .apply()

df.apply(lambda row: row['A'] in row['B'], axis = 1)

This creates a Boolean Series which you can use for your loc condition which selects only the columns where 'A' is in 'B':

df.loc[df.apply(lambda row: row['A'] in row['B'], axis = 1)]

CodePudding user response：

df_1[df_1.apply(lambda x:x['A'] in x['B'],axis=1)]