Hello guys I need your help.
I have df with two columns A and B both of them are columns with string values
example:
df_1 = pd.DataFrame(data={
"A":['a','b','c'],
"B":['a x d','z y w','q m c'] #string values not a list
})
print(df_1)
#output
A B
0 a a x d
1 b z y w
2 c q m c
now what I'm trying to do is to preform loc in the df_1 to get all the row that col B cointain the string value in col A.
In this example the output i want is the first and the third rows
A B
0 a a x d # 'a x d' contain value 'a'
2 c q m c # 'q m c' contain value 'c'
I have tried different loc condition but got unhashable type: 'Series' error:
df_1.loc[df_1["B"].str.contains(df_1["A"])] #TypeError: unhashable type: 'Series'
df_1.loc[df_1["A"] in df_1["B"]] #TypeError: unhashable type: 'Series'
I really don't want to use a for/while loop because of the size of the df.
any idea how can I preform this? Thanks
CodePudding user response:
Cute!
from operator import contains
df_1.loc[map(contains, *map(df_1.get, ['B', 'A']))]
A B
0 a a x d
2 c q m c
And without the import but with an ugly dunder ... /-:
df_1.loc[map(str.__contains__, *map(df_1.get, ['B', 'A']))]
Also, to the OP... don't accept this answer. I was just golfing. You should never use this in production code as it is less than transparent due to the use of 2 maps and a splat.
And just to be super duper clear...
CodePudding user response:
There is no vectorial method, to map in
using two columns. You need to loop here:
mask = [a in b for a,b in zip(df_1['A'], df_1['B'])]
df_1.loc[mask]
Output:
A B
0 a a x d
2 c q m c
comparison of speed (3000 rows)
# operator.contains
518 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# list comprehension
554 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# numpy.apply_along_axis
7.32 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# apply
20.7 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
You can try:
out = df_1.loc[df_1.apply(lambda x: x['A'] in x['B'], axis=1)]
print(out)
# Output
A B
0 a a x d
2 c q m c
Alternative with numpy
:
df_1.loc[np.apply_along_axis(lambda x: x[0] in x[1], axis=1, arr=df_1)]
CodePudding user response:
When accessing two different columns in pandas you can use .apply()
df.apply(lambda row: row['A'] in row['B'], axis = 1)
This creates a Boolean Series which you can use for your loc condition which selects only the columns where 'A' is in 'B':
df.loc[df.apply(lambda row: row['A'] in row['B'], axis = 1)]
CodePudding user response:
df_1[df_1.apply(lambda x:x['A'] in x['B'],axis=1)]