Get a list of rows starting from the same value as current row in pandas dataframe-CodePudding

I have a dataframe that I'd like to expand with a new column which would contain/match the list of all ids if they fully contain the row string_value

id  string_value
1   The quick brown fox 
2   The quick brown fox jumps  
3   The quick brown fox jumps over 
4   The quick brown fox jumps over the lazy dog
5   The slow 
6   The slow brown fox

Desired output

id  string_value                                new_columns
1   The quick brown fox                         [2, 3, 4]
2   The quick brown fox jumps                   [3, 4]
3   The quick brown fox jumps over              [4]
4   The quick brown fox jumps over the lazy dog []
5   The slow                                    [6]
6   The slow brown fox                          []

Thanks

CodePudding user response：

You can't easily vectorize this, but you can use a custom function:

def accumulate(s):
    ref = None
    prev = s.index[0]
    out = {}
    for i, val in s.items():
        if ref and val.startswith(ref):
            tmp.append(prev)
        else:
            tmp = []
        ref = val
        prev = i
        out[i] = tmp.copy()

    # invert dictionary
    out2 = {}
    for v,l in out.items():
        for k in l:
            out2.setdefault(k, []).append(v)
    
    return pd.Series(out2)

df['new_columns'] = df['id'].map(accumulate(df.set_index('id')['string_value'].sort_values()))

output:

   id                                 string_value new_columns
0   1                          The quick brown fox   [2, 3, 4]
1   2                    The quick brown fox jumps      [3, 4]
2   3               The quick brown fox jumps over         [4]
3   4  The quick brown fox jumps over the lazy dog         NaN
4   5                                     The slow         [6]
5   6                           The slow brown fox         NaN

empty lists

to have empty lists in the output in place of NaN, change the "invert dictionary" code to:

    # invert dictionary
    out2 = {i: [] for i in s.index}
    for v,l in out.items():
        for k in l:
            out2[k].append(v)

CodePudding user response：

Here's another custom function you can consider. Assuming df is this:

   id                                 string_value
0   1                          The quick brown fox
1   2                    The quick brown fox jumps
2   3               The quick brown fox jumps over
3   4  The quick brown fox jumps over the lazy dog
4   5                                     The slow
5   6                           The slow brown fox

The custom function is

def match_string(string_value):
    idx_list = []
    for idx, strg in list(zip(df['id'], df['string_value'])):
        if strg == string_value:
            continue
        if string_value in strg:
            idx_list.append(idx)
    return idx_list

Then use lambda function:

df['new_columns'] = df['string_value'].apply(lambda x: match_string(x))
print(df)

   id                                 string_value new_columns
0   1                          The quick brown fox   [2, 3, 4]
1   2                    The quick brown fox jumps      [3, 4]
2   3               The quick brown fox jumps over         [4]
3   4  The quick brown fox jumps over the lazy dog          []
4   5                                     The slow         [6]
5   6                           The slow brown fox          []

CodePudding user response：

Another way to do it using numpy:

# Check if a string starts with any other strings
mask = np.vstack([df["string_value"].str.contains(s) for s in df["string_value"]])

# We don't want to match a string to itself
# so mask out the diagonal
np.fill_diagonal(mask, False)

# Result
ids = df["id"].to_numpy()
df["new_column"] = [ids[m] for m in mask]