Home > Mobile >  How to generate all possible word combinations between columns of a CSV file?
How to generate all possible word combinations between columns of a CSV file?

Time:09-27

I would like to take a CSV file with each column being a category of words and generate all possible combinations among them. Here is the simplified CSV file I am using to test the script (but the CSV that will be used will be larger, with a dozen or more columns):

$ cat file.csv 
Color,Pet,Action
black,dog,barks
brown,cat,runs
white,bird,flies
,hamster,
red,,swims

As you can see, some columns have more words (i.e. there could be more "colors" than "pets", or more "pets" than "actions" for example).

Here's what I have so far:

import csv
import itertools

with open('file.csv', newline='') as csvfile:
    next(csvfile, None) #skip header row
    data = list(csv.reader(csvfile))

for combination in itertools.product(*data):
    print(combination)

And here's an excerpt of the output I am getting:

$ python3 combiner.py 
('black', 'brown', 'white', '', 'red')
('black', 'brown', 'white', '', '')
('black', 'brown', 'white', '', 'swims')
('black', 'brown', 'white', 'hamster', 'red')
('black', 'brown', 'white', 'hamster', '')
('black', 'brown', 'white', 'hamster', 'swims')
('black', 'brown', 'white', '', 'red')
('black', 'brown', 'white', '', '')
('black', 'brown', 'white', '', 'swims')
('black', 'brown', 'bird', '', 'red')
('black', 'brown', 'bird', '', '')
[...]

What I would like to accomplish:

  • not have multiple items from the same category (column) in the same output line
  • removing parentheses, quotes and commas (I believe I can accomplish that by converting the array to a string before printing)

So, to give an example of the output I am trying to get:

black
black dog
black dog barks
black dog runs
black dog flies
black dog swims
black cat
black cat barks
black cat runs
black cat flies
black cat swims
brown
brown dog
brown dog barks
[...]
black hamster
black hamster flies
[...]
red fish runs
[...]

If anyone has a suggestion on the most efficient way to accomplish this (or a specific library or approach to take), I would appreciate it greatly.

CodePudding user response:

The trick is to group the columns together before passing them to itertools.product.

To print rows like "black" and "black dog" that don't include all of the values of any given iteration, you can store the first iteration as a list, and then compare the values in subsequent iterations, updating the list and printing the values as the values change.

The solution below generalizes to any number of columns.

import csv
import itertools

with open("file.csv", "r", newline="", encoding="utf-8") as csvfile:
    reader = csv.reader(csvfile)
    header_row = next(reader)
    columns = [[] for _ in header_row]
    for row in reader:
        for i, value in enumerate(row):
            if value:
                columns[i].append(value)

product_iter = itertools.product(*columns)
current_combination = list(next(product_iter))
for i in range(len(current_combination)):
    print(" ".join(current_combination[:i   1]))

for combination in product_iter:
    for i in range(len(combination)):
        if combination[i] != current_combination[i]:
            current_combination[i] = combination[i]
            print(" ".join(current_combination[:i   1]))

Output:

black
black dog
black dog barks
black dog runs
black dog flies
black dog swims
black cat
black cat barks
...
  • Related