python - Substitute all elements in a dataframe that don't contain certain words-CodePudding

I have a very large dataframe and I want to substitute all elements that do not contain a specific word with NaN (while keeping the first "id" column unchanged).

For example:

index  id    text1                        text2                        ...
1      123   {'"key'": '"living_space'"   '"value'": '"01.04.2022'"}   ...
2      124   {'"key'": '"rooms'"          '"value'": '"3'"}            ...
3      125   23                           {'"key'": '"rooms'"          ...
4      126   45                           Apartment sold               ...

I want to keep all elements in the dataframe that contain the words key or value and substitute all else with nan, so I would get a dataframe like:

index  id    text1                        text2                        ...
1      123   {'"key'": '"living_space'"   '"value'": '"01.04.2022'"}   ...
2      124   {'"key'": '"rooms'"          '"value'": '"3'"}            ...
3      125   nan                          {'"key'": '"rooms'"          ...
4      126   nan                          nan                          ...

I have tried using the following code, but it is just clears the whole dataset.

l1 = ['key', 'value']
df.iloc[:,1:] = df.iloc[:,1:].applymap(lambda x: x if set(x.split()).intersection(l1) else '')

Thanks in advance.

CodePudding user response：

Consider the following approach to solve the problem. It consists of 2 parts. (1) The logic to decide whether to keep or to erase data is implemented in the function substring_filter - we simply check if target string contains any word from words. (2) Actual filtering is performed with np.where - very convinient helper function from numpy.

import numpy as np
import pandas as pd


def substring_filter(target, words):
    for word in words:
        if word in target:
            return True
    return False


if __name__ == '__main__':

    df = pd.DataFrame({
        'A': [1, 2, 3, 4],
        'B': [True, False, False, True],
        'C': ['{"key": 1}', '{"value": 2}', 'text', 'abc']})

    words_to_search = ('key', 'value')
    df.loc[:, 'C'] = np.where(
        df.loc[:, 'C'].apply(lambda x: substring_filter(x, words_to_search)),
        df.loc[:, 'C'],
        None)
    print(df)

Result is:

   A      B             C
0  1   True    {"key": 1}
1  2  False  {"value": 2}
2  3  False          None
3  4   True          None