Extracting unique values from multiple rows in csv file-CodePudding

I have real estate data csv file. There's a lot of information in rows, which is repeating, like in example below:

Row1:
Su baldais, Skalbimo mašina, **Viryklė**, **Indaplovė**, Vonia
Row2
Virtuvės komplektas, **Viryklė**, **Indaplovė**, Dušo kabina, Rekuperacinė sistema

As you can see, there is a lot of data, which is repeating (i marked it with stars). Is there a way to get only unique values from all the rows with python ?

CodePudding user response：

It's not entirely clear what you want, so I will include two scenarios:

Your data as example.csv in the cwd:

Su baldais,Skalbimo mašina,Viryklė,Indaplovė,Vonia
Virtuvės komplektas,Viryklė,Indaplovė,Dušo kabina,Rekuperacinė sistema

Scenario 1

You want every value that appears in the csv, but do not want any value more than once. A perfect use case for a set, which will only store each value once.

#!/usr/bin/env python3
import csv

unique_values = set()

with open("example.csv") as handle:
    reader = csv.reader(handle)
    for row in reader:
        unique_values.update(row)

print(", ".join(unique_values))

result:

Skalbimo mašina, Dušo kabina, Rekuperacinė sistema, Su baldais, Indaplovė, Virtuvės komplektas, Viryklė, Vonia

Scenario 2

You want only the unique values from the csv, discarding any values that appear more than once.

#!/usr/bin/env python3
import csv

all_values = set()
to_delete = set()

with open("example.csv") as handle:
    reader = csv.reader(handle)
    for row in reader:
        for value in row:
            if value in all_values:
                to_delete.add(value)
            else:
                all_values.add(value)

print(", ".join(all_values - to_delete))

Here I use two sets, a second set called to_delete which contains any values we see more than once. I run all_values - to_delete to give me a set of only the values that were totally unique.

Result:

Dušo kabina, Su baldais, Virtuvės komplektas, Skalbimo mašina, Vonia, Rekuperacinė sistema