I have a dataframe that looks like this:
id sentences ind tar
0 In samples of depression injected intraneously... depression albumin
0 Monomethylmethacrylate in whole blood was asso... depression albumin
1 In samples of depression injected intraneously... depression hip
1 Monomethylmethacrylate in whole blood was asso... depression hip
2 The GVH kinetics and cellular characteristics ... GVH,GVH,GVH,GVH... PFC
2 Effects on PFCgeneword responses to thymus-dep... GVH,GVH,GVH,GVH... PFC
2 The unresponsive state which developed in GVHg... GVH,GVH,GVH,GVH... PFC
2 Furthermore, GVHgeneword spleen cells suppress... GVH,GVH,GVH,GVH... PFC
2 This active suppressor effect was found to be ... GVH,GVH,GVH,GVH... PFC
2 The delayed transfer of GVHgeneword cells to i... GVH,GVH,GVH,GVH... PFC
I want to keep only the rows that have either an ind
or a tar
value in the corresponding sentence
.
The problem is that when I have more than one elements in either ind
or tar
, even if one of those elements exists on sentence
, it doesn't match it, because it uses the whole string as a term. For example, at the 5th row, even though the word GVH exists in the sentence, it uses as ind
the whole value GVH,GVH,GVH,GVH
and not each GVH term separately. Can someone help how to fix this issue? Here's my code so far :
df['check_ind'] = df.apply(lambda x: x.ind in x.sentences, axis=1)
df['check_tar'] = df.apply(lambda x: x.tar in x.sentences, axis=1)
df = df.loc[(df['check_ind'] == True) | (df['check_tar'] == True)]
print(df.sentences.iloc[4], '\n')
print(df.indications.iloc[4], '\n')
print(df.targets.iloc[4], '\n')
print(df.check_ind.iloc[4], '\n')
print(df.check_tar.iloc[4], '\n')
>>>> The GVH kinetics and cellular characteristics indicated that suppressor T cells exert an anti-mitotic influence on antigen-stimulated B-cell proliferation. .
>>>> GVH,GVH,GVH,GVH,GVH,GVH
>>>> PFC
>>>> False (This should return TRUE since GVH is in the sentence)
>>>> False
CodePudding user response:
Your code is currently treating x.ind
as if it were a simple value.
Conceptually x.ind
is not a single value, but rather a comma-separated list of values.
In python, you can transform a comma-separated list into an actual python list using x.split(',')
. In addition, str.strip()
is useful to remove possible spaces (for instance, if you have "GVH ,GVH "
, the spaces should probably be ignored).
Finally, builtin function any
and all
are convenient to broadcast a condition to a list.
df['check_ind'] = df.apply(lambda x: any(v.strip() in x.sentences for v in x.split(',')), axis=1)
CodePudding user response:
You could first concat
"ind" and "tar" columns so that you could do only one evaluation.
Then use str.split
explode
apply
an evaluator to check if any "ind" or "tar" exist. Then groupby
any
to get back into original shape:
new_df = pd.concat((df[['id','sentences','ind']], df[['id','sentences','tar']].rename(columns={'tar':'ind'})))
new_df['ind'] = new_df['ind'].str.split(',')
msk = new_df.explode('ind').apply(lambda x: x['ind'] in x['sentences'], axis=1).groupby(level=0).any()
out = df[msk]
Output:
id sentences ind tar
0 0 In samples of depression injected intraneously... depression albumin
2 1 In samples of depression injected intraneously... depression hip
4 2 The GVH kinetics and cellular characteristics ... GVH,GVH,GVH,GVH... PFC
5 2 Effects on PFCgeneword responses to thymus-dep... GVH,GVH,GVH,GVH... PFC
6 2 The unresponsive state which developed in GVHg... GVH,GVH,GVH,GVH... PFC
7 2 Furthermore, GVHgeneword spleen cells suppress... GVH,GVH,GVH,GVH... PFC
9 2 The delayed transfer of GVHgeneword cells to i... GVH,GVH,GVH,GVH... PFC
CodePudding user response:
Are the terms in ind that are comma separated always duplicates?
If they are you can try the following:
df['check_ind'] = df.apply(lambda x: x.ind.split(',')[0] in x.sentences, axis=1)
This searches for the first term before the comma.
CodePudding user response:
You can define a method which checks both and then use it in apply()
. This method can also be used to split the values in each of these rows, assuming ,
is never used in text and all lists are in this exact notation without spaces.
import pandas
def sent_contains_ind_or_tar(row):
return any(ind in row["sentences"] for ind in row["ind"]) or any(ind in row["sentences"] for ind in row["tar"])
df = df[df.apply(sent_contains_ind_or_tar, axis=1)]
For example:
df = pandas.DataFrame([[1, "abc", "u", "v"],
[2, "xyz", "x", "z"],
[3, "xya", "x", "z"]],
columns=["id", "sentences", "ind", "tar"])
print(df)
> id sentences ind tar
0 1 abc u v
1 2 xyz x z
2 3 xya x z
def sent_contains_ind_or_tar(row):
return row["ind"] in row["sentences"] or row["tar"] in row["sentences"]
df = df[df.apply(sent_contains_ind_or_tar, axis=1)]
print(df)
> id sentences ind tar
1 2 xyz x z
Edit: Added list case to method