I have a dataframe that I'd like to expand with a new column which would contain/match the list of all ids if they fully contain the row string_value
id string_value
1 The quick brown fox
2 The quick brown fox jumps
3 The quick brown fox jumps over
4 The quick brown fox jumps over the lazy dog
5 The slow
6 The slow brown fox
Desired output
id string_value new_columns
1 The quick brown fox [2, 3, 4]
2 The quick brown fox jumps [3, 4]
3 The quick brown fox jumps over [4]
4 The quick brown fox jumps over the lazy dog []
5 The slow [6]
6 The slow brown fox []
Thanks
CodePudding user response:
You can't easily vectorize this, but you can use a custom function:
def accumulate(s):
ref = None
prev = s.index[0]
out = {}
for i, val in s.items():
if ref and val.startswith(ref):
tmp.append(prev)
else:
tmp = []
ref = val
prev = i
out[i] = tmp.copy()
# invert dictionary
out2 = {}
for v,l in out.items():
for k in l:
out2.setdefault(k, []).append(v)
return pd.Series(out2)
df['new_columns'] = df['id'].map(accumulate(df.set_index('id')['string_value'].sort_values()))
output:
id string_value new_columns
0 1 The quick brown fox [2, 3, 4]
1 2 The quick brown fox jumps [3, 4]
2 3 The quick brown fox jumps over [4]
3 4 The quick brown fox jumps over the lazy dog NaN
4 5 The slow [6]
5 6 The slow brown fox NaN
empty lists
to have empty lists in the output in place of NaN, change the "invert dictionary" code to:
# invert dictionary
out2 = {i: [] for i in s.index}
for v,l in out.items():
for k in l:
out2[k].append(v)
CodePudding user response:
Here's another custom function you can consider. Assuming df
is this:
id string_value
0 1 The quick brown fox
1 2 The quick brown fox jumps
2 3 The quick brown fox jumps over
3 4 The quick brown fox jumps over the lazy dog
4 5 The slow
5 6 The slow brown fox
The custom function is
def match_string(string_value):
idx_list = []
for idx, strg in list(zip(df['id'], df['string_value'])):
if strg == string_value:
continue
if string_value in strg:
idx_list.append(idx)
return idx_list
Then use lambda
function:
df['new_columns'] = df['string_value'].apply(lambda x: match_string(x))
print(df)
id string_value new_columns
0 1 The quick brown fox [2, 3, 4]
1 2 The quick brown fox jumps [3, 4]
2 3 The quick brown fox jumps over [4]
3 4 The quick brown fox jumps over the lazy dog []
4 5 The slow [6]
5 6 The slow brown fox []
CodePudding user response:
Another way to do it using numpy:
# Check if a string starts with any other strings
mask = np.vstack([df["string_value"].str.contains(s) for s in df["string_value"]])
# We don't want to match a string to itself
# so mask out the diagonal
np.fill_diagonal(mask, False)
# Result
ids = df["id"].to_numpy()
df["new_column"] = [ids[m] for m in mask]