Pandas dataframe reports no matching string when the string is present-CodePudding

Fairly new to python. This seems to be a really simple question but I can't find any information about it. I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.

So my dataframe is something like the following:

A=pd.DataFrame(["ancestry","time","history"])

I should simply be able to use the "string in dataframe" method, as in

"time" in A

This returns False however. If I run

"time" == A.iloc[1]

it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is. Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?

CodePudding user response：

Add .values to the end:

'time' in A.values

As you've noticed, the str in pandas.DataFrame syntax doesn't produce the result you want. But the str in numpy.array works as you expect, and .values transforms the dataframe into a Numpy array.

CodePudding user response：

The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:

>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
          0
0  ancestry
1      time
2   history

>>> A == "time"  # or A.eq("time")
       0
0  False
1   True
2  False

>>> (A == "time").any()
0    True
dtype: bool

Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:

>>> (A == "time").any().any()
True

CodePudding user response：

I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.

>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False

CodePudding user response：

I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.

df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])

df['A'].str.contains("time")