I have real estate data csv file. There's a lot of information in rows, which is repeating, like in example below:
Row1:
Su baldais, Skalbimo mašina, **Viryklė**, **Indaplovė**, Vonia
Row2
Virtuvės komplektas, **Viryklė**, **Indaplovė**, Dušo kabina, Rekuperacinė sistema
As you can see, there is a lot of data, which is repeating (i marked it with stars). Is there a way to get only unique values from all the rows with python ?
CodePudding user response:
It's not entirely clear what you want, so I will include two scenarios:
Your data as example.csv
in the cwd:
Su baldais,Skalbimo mašina,Viryklė,Indaplovė,Vonia
Virtuvės komplektas,Viryklė,Indaplovė,Dušo kabina,Rekuperacinė sistema
Scenario 1
You want every value that appears in the csv, but do not want any value more than once. A perfect use case for a set, which will only store each value once.
#!/usr/bin/env python3
import csv
unique_values = set()
with open("example.csv") as handle:
reader = csv.reader(handle)
for row in reader:
unique_values.update(row)
print(", ".join(unique_values))
result:
Skalbimo mašina, Dušo kabina, Rekuperacinė sistema, Su baldais, Indaplovė, Virtuvės komplektas, Viryklė, Vonia
Scenario 2
You want only the unique values from the csv, discarding any values that appear more than once.
#!/usr/bin/env python3
import csv
all_values = set()
to_delete = set()
with open("example.csv") as handle:
reader = csv.reader(handle)
for row in reader:
for value in row:
if value in all_values:
to_delete.add(value)
else:
all_values.add(value)
print(", ".join(all_values - to_delete))
Here I use two sets, a second set called to_delete
which contains any values we see more than once. I run all_values - to_delete
to give me a set of only the values that were totally unique.
Result:
Dušo kabina, Su baldais, Virtuvės komplektas, Skalbimo mašina, Vonia, Rekuperacinė sistema