I have a very large dataframe and I want to substitute all elements that do not contain a specific word with NaN (while keeping the first "id" column unchanged).
For example:
index id text1 text2 ...
1 123 {'"key'": '"living_space'" '"value'": '"01.04.2022'"} ...
2 124 {'"key'": '"rooms'" '"value'": '"3'"} ...
3 125 23 {'"key'": '"rooms'" ...
4 126 45 Apartment sold ...
I want to keep all elements in the dataframe that contain the words key or value and substitute all else with nan, so I would get a dataframe like:
index id text1 text2 ...
1 123 {'"key'": '"living_space'" '"value'": '"01.04.2022'"} ...
2 124 {'"key'": '"rooms'" '"value'": '"3'"} ...
3 125 nan {'"key'": '"rooms'" ...
4 126 nan nan ...
I have tried using the following code, but it is just clears the whole dataset.
l1 = ['key', 'value']
df.iloc[:,1:] = df.iloc[:,1:].applymap(lambda x: x if set(x.split()).intersection(l1) else '')
Thanks in advance.
CodePudding user response:
Consider the following approach to solve the problem. It consists of 2 parts. (1) The logic to decide whether to keep or to erase data is implemented in the function substring_filter
- we simply check if target
string contains any word from words
. (2) Actual filtering is performed with np.where
- very convinient helper function from numpy.
import numpy as np
import pandas as pd
def substring_filter(target, words):
for word in words:
if word in target:
return True
return False
if __name__ == '__main__':
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [True, False, False, True],
'C': ['{"key": 1}', '{"value": 2}', 'text', 'abc']})
words_to_search = ('key', 'value')
df.loc[:, 'C'] = np.where(
df.loc[:, 'C'].apply(lambda x: substring_filter(x, words_to_search)),
df.loc[:, 'C'],
None)
print(df)
Result is:
A B C
0 1 True {"key": 1}
1 2 False {"value": 2}
2 3 False None
3 4 True None