Fairly new to python. This seems to be a really simple question but I can't find any information about it. I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however. If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is. Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
CodePudding user response:
Add .values
to the end:
'time' in A.values
As you've noticed, the str in pandas.DataFrame
syntax doesn't produce the result you want. But the str in numpy.array
works as you expect, and .values
transforms the dataframe into a Numpy array.
CodePudding user response:
The way to deal with this is to compare the whole dataframe with "time"
. That will return a mask where each value of the DF is True if it was time
, False otherwise. Then, you can use .any()
to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any()
returns a Series where each entry is a column and whether or not that column contained time
. If you want to check the entire dataframe (across all columns), call .any()
twice:
>>> (A == "time").any().any()
True
CodePudding user response:
I believe (myseries==mystr).any()
will do what you ask. The special __contains__
method of DataFrames (which informs behavior of in
) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
CodePudding user response:
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")