Find string matching among columns-CodePudding

I have a dataframe that looks like this:

id      sentences                                           ind                 tar
0       In samples of depression injected intraneously...   depression        albumin
0       Monomethylmethacrylate in whole blood was asso...   depression        albumin
1       In samples of depression injected intraneously...   depression          hip
1       Monomethylmethacrylate in whole blood was asso...   depression          hip
2       The GVH kinetics and cellular characteristics ...   GVH,GVH,GVH,GVH...  PFC
2       Effects on PFCgeneword responses to thymus-dep...   GVH,GVH,GVH,GVH...  PFC
2       The unresponsive state which developed in GVHg...   GVH,GVH,GVH,GVH...  PFC
2       Furthermore, GVHgeneword spleen cells suppress...   GVH,GVH,GVH,GVH...  PFC
2       This active suppressor effect was found to be ...   GVH,GVH,GVH,GVH...  PFC
2       The delayed transfer of GVHgeneword cells to i...   GVH,GVH,GVH,GVH...  PFC

I want to keep only the rows that have either an ind or a tar value in the corresponding sentence.

The problem is that when I have more than one elements in either ind or tar, even if one of those elements exists on sentence, it doesn't match it, because it uses the whole string as a term. For example, at the 5th row, even though the word GVH exists in the sentence, it uses as ind the whole value GVH,GVH,GVH,GVH and not each GVH term separately. Can someone help how to fix this issue? Here's my code so far :

df['check_ind'] = df.apply(lambda x: x.ind in x.sentences, axis=1)
df['check_tar'] = df.apply(lambda x: x.tar in x.sentences, axis=1)
df = df.loc[(df['check_ind'] == True) | (df['check_tar'] == True)]

print(df.sentences.iloc[4], '\n')

print(df.indications.iloc[4], '\n')

print(df.targets.iloc[4], '\n')

print(df.check_ind.iloc[4], '\n')

print(df.check_tar.iloc[4], '\n')


>>>> The GVH kinetics and cellular characteristics indicated that suppressor T cells exert an anti-mitotic influence on antigen-stimulated B-cell proliferation. . 

>>>> GVH,GVH,GVH,GVH,GVH,GVH 

>>>> PFC 

>>>> False (This should return TRUE since GVH is in the sentence)

>>>> False

CodePudding user response：

Your code is currently treating x.ind as if it were a simple value.

Conceptually x.ind is not a single value, but rather a comma-separated list of values.

In python, you can transform a comma-separated list into an actual python list using x.split(','). In addition, str.strip() is useful to remove possible spaces (for instance, if you have "GVH ,GVH ", the spaces should probably be ignored).

Finally, builtin function any and all are convenient to broadcast a condition to a list.

df['check_ind'] = df.apply(lambda x: any(v.strip() in x.sentences for v in x.split(',')), axis=1)

CodePudding user response：

You could first concat "ind" and "tar" columns so that you could do only one evaluation.

Then use str.split explode apply an evaluator to check if any "ind" or "tar" exist. Then groupby any to get back into original shape:

new_df = pd.concat((df[['id','sentences','ind']], df[['id','sentences','tar']].rename(columns={'tar':'ind'})))
new_df['ind'] = new_df['ind'].str.split(',')
msk = new_df.explode('ind').apply(lambda x: x['ind'] in x['sentences'], axis=1).groupby(level=0).any()
out = df[msk]

Output:

   id                                          sentences                 ind      tar  
0   0  In samples of depression injected intraneously...          depression  albumin  
2   1  In samples of depression injected intraneously...          depression      hip  
4   2  The GVH kinetics and cellular characteristics ...  GVH,GVH,GVH,GVH...      PFC  
5   2  Effects on PFCgeneword responses to thymus-dep...  GVH,GVH,GVH,GVH...      PFC  
6   2  The unresponsive state which developed in GVHg...  GVH,GVH,GVH,GVH...      PFC  
7   2  Furthermore, GVHgeneword spleen cells suppress...  GVH,GVH,GVH,GVH...      PFC  
9   2  The delayed transfer of GVHgeneword cells to i...  GVH,GVH,GVH,GVH...      PFC

CodePudding user response：

Are the terms in ind that are comma separated always duplicates?

If they are you can try the following:

df['check_ind'] = df.apply(lambda x: x.ind.split(',')[0] in x.sentences, axis=1)

This searches for the first term before the comma.

CodePudding user response：

You can define a method which checks both and then use it in apply(). This method can also be used to split the values in each of these rows, assuming , is never used in text and all lists are in this exact notation without spaces.

import pandas

def sent_contains_ind_or_tar(row):
    return any(ind in row["sentences"] for ind in row["ind"]) or any(ind in row["sentences"] for ind in row["tar"])

df = df[df.apply(sent_contains_ind_or_tar, axis=1)]

For example:

df = pandas.DataFrame([[1, "abc", "u", "v"],
                       [2, "xyz", "x", "z"],
                       [3, "xya", "x", "z"]],
                      columns=["id", "sentences", "ind", "tar"])
print(df)
>    id sentences ind tar
  0   1       abc   u   v
  1   2       xyz   x   z
  2   3       xya   x   z


def sent_contains_ind_or_tar(row):
    return row["ind"] in row["sentences"] or row["tar"] in row["sentences"]

df = df[df.apply(sent_contains_ind_or_tar, axis=1)]
print(df)
>    id sentences ind tar
  1   2       xyz   x   z

Edit: Added list case to method