Home > Enterprise >  pandas explode rows with all substring combinations
pandas explode rows with all substring combinations

Time:05-02

I'm trying to find all combinations of a substring found in a single column, and then explode the dataframe with all possible combinations of each word.

Example Dataframe

                     URL                Keyword
0  http://www.amazon.com  Amazon Lightning Sale
1   https://www.ebay.com        Shop eBay Today

Desired Output

                      URL                Keyword
0   http://www.amazon.com  Amazon Lightning Sale
1   http://www.amazon.com  Amazon Sale Lightning
2   http://www.amazon.com  Lightning Amazon Sale
3   http://www.amazon.com  Sale Amazon Lightning
4   http://www.amazon.com  Sale Lightning Amazon
5   http://www.amazon.com  Lightning Sale Amazon
6    https://www.ebay.com        Shop eBay Today
7    https://www.ebay.com        Shop Today eBay
8    https://www.ebay.com        eBay Shop Today
9    https://www.ebay.com        eBay Today Shop
10   https://www.ebay.com        Today eBay Shop
11   https://www.ebay.com        Today Shop eBay

Minimum Reproducable Example

import pandas as pd

# initialize data of lists.
data = {'URL': ['http://www.amazon.com', 'https://www.ebay.com'],
        'Keyword': ["Amazon Lightning Sale", "Shop eBay Today"]}

# Create DataFrame
df = pd.DataFrame(data)

# Print the output.
print(df)

I've tried the solution here: Pandas DataFrame Combinations and expand but it's not quite what I need.

CodePudding user response:

from itertools import permutations

df['Keyword'] = df['Keyword'].apply(lambda x: list(permutations(x.split())))
df.explode('Keyword', ignore_index=True)

First the itertools.permutations method applied to the Keyword column will create all possible combinations of keywords as a list.

Next you can use the pandas.DataFrame.explode function to create many items from the created lists.

If you really want a full string instead of a tuple of keywords, you can replace the list(...) part with a string join: [" ".join(t) for t in permutations(x.split())]

CodePudding user response:

  1. create an ID for each row

     #create an id:
     df['ID'] = range(df.shape[0])
    
                      URL                Keyword      ID
     0  http://www.amazon.com  Amazon Lightning Sale  0
     1   https://www.ebay.com        Shop eBay Today  1
    
  2. Create the combinations for each row as a new DataFrame, or dict or ...

    import re
    import itertools
    def create_combinations(id, kw):
    
        # split the keywords by word:
        words = re.split('\W ', kw)
    
        return pd.DataFrame([ {'ID':id, 'Combinations' : ' '.join(x) } for x in itertools.permutations(words)])
    
    
    # create the combinations
    data = []
    for id, kw in zip(df.ID, df.Keyword):
        data.append( create_combinations(id, kw) )
    
  3. Join

    # join the data
    pd.merge(df, pd.concat(data), on = ['ID'])
    
    
             URL     Keyword     ID  Combinations
        0    http://www.amazon.com   Amazon Lightning Sale   0   Amazon Lightning Sale
        1    http://www.amazon.com   Amazon Lightning Sale   0   Amazon Sale Lightning
        2    http://www.amazon.com   Amazon Lightning Sale   0   Lightning Amazon Sale
        3    http://www.amazon.com   Amazon Lightning Sale   0   Lightning Sale Amazon
        4    http://www.amazon.com   Amazon Lightning Sale   0   Sale Amazon Lightning
        5    http://www.amazon.com   Amazon Lightning Sale   0   Sale Lightning Amazon
        6    https://www.ebay.com    Shop eBay Today     1   Shop eBay Today
        7    https://www.ebay.com    Shop eBay Today     1   Shop Today eBay
        8    https://www.ebay.com    Shop eBay Today     1   eBay Shop Today
        9    https://www.ebay.com    Shop eBay Today     1   eBay Today Shop
        10   https://www.ebay.com    Shop eBay Today     1   Today Shop eBay
        11   https://www.ebay.com    Shop eBay Today     1   Today eBay Shop
    

Joost Döbken's answer is a bit more elegant

CodePudding user response:

Here is an alternative way without using itertools:

(df.assign(Keyword = df['Keyword'].str.split().map(lambda x: [[i,j,k] for i in x for j in x for k in x if len({i,j,k})==len(x)]))
 .explode('Keyword')
 .assign(Keyword = lambda x: x['Keyword'].str.join(' ')))

Output:

                     URL                Keyword
0  http://www.amazon.com  Amazon Lightning Sale
0  http://www.amazon.com  Amazon Sale Lightning
0  http://www.amazon.com  Lightning Amazon Sale
0  http://www.amazon.com  Lightning Sale Amazon
0  http://www.amazon.com  Sale Amazon Lightning
0  http://www.amazon.com  Sale Lightning Amazon
1   https://www.ebay.com        Shop eBay Today
1   https://www.ebay.com        Shop Today eBay
1   https://www.ebay.com        eBay Shop Today
1   https://www.ebay.com        eBay Today Shop
1   https://www.ebay.com        Today Shop eBay
1   https://www.ebay.com        Today eBay Shop
  • Related