How to get rid of duplicate rows in python .csv-CodePudding

I am trying to make a sorting system.

I have the following values in a .csv file

Dan,20,30,15
Dan,15,20,20
Dan,17,11,10
Alex,10,10,10
Alex,11,20,30

The last name along with the values should remain and the previous ones should get deleted. for example, the following two should be re-written into the .csv file, everything else deleted:

Dan,17,11,10
Alex,11,20,30

It sounds so much easier than it actually is and I seriously need help with this sorting algorithm.

CodePudding user response：

You can try to collect your rows in a dictionary by given key (e.g. name in your case) and then write its contents back into a file.

import csv

unique_rows = {}

with open("data.csv", "r", newline="") as in_file:
    for row in csv.reader(in_file):
        unique_rows[row[0]] = row  # where 0 is the index of your key column

with open("data.csv", "w", newline="") as out_file:
    writer = csv.writer(out_file)
    writer.writerows(unique_rows.values())

On each duplicate, the latter row will overwrite the previous one stored in the dictionary. Or just be stored, if no given key is present in the dict.

CodePudding user response：

I think as @mozway suggested, you're looking for the groupby.last method. Here it is applied to your example:

import pandas as pd

df = pd.DataFrame(
    [
        ['Dan', 20, 30, 15],
        ['Dan', 15, 20, 20],
        ['Dan', 17, 11, 10],
        ['Alex', 10, 10, 10],
        ['Alex', 11, 20, 30],
    ],
    columns=['Name', 'A', 'B', 'C']
)
print(df.groupby('Name').last())

       A   B   C
Name            
Alex  11  20  30
Dan   17  11  10

CodePudding user response：

you need to read it all the data and save the data to list of dict, then you need to reconstruct the list to the key value dict with name as the key so every you got the same name the value will be reassign

example

name,v1,v2,v3,
Dan,20,30,50,
Dan,24,2,75,
Dan,25,78,23,
Alex,12,22,98,
Alex,33,12,32,

code:

import csv

data = []
with open('csvFile.csv') as csv_file:
    data = [{k: v for k, v in row.items()}
        for row in csv.DictReader(csv_file, skipinitialspace=True)]

new_data = {}
for item in data:
        new_data[item['name']] = [item[value] for value in item]

final_data = [new_data[item] for item in new_data]


print(final_data)
#output
[['Dan', '25', '78', '23', ''], ['Alex', '33', '12', '32', '']]

CodePudding user response：

I assume that you have the content of the CSV already in an Array. There is no need for external libraries, just use enumerate() to get the index. You can read the list by using index in a relative way.

This should do what you want:

def get_uniques():
    csv_reading = ['Dan,20,30,15', 'Dan,15,20,20', 'Dan,17,11,10', 'Alex,10,10,10', 'Alex,11,20,30']
    final_result = []
    for index, row in enumerate(csv_reading):
        name = row.split(',')[0]  # get the actual name e.g. 'Dan'
        if index < len(csv_reading) - 1:  # needed to avoid index errors 
            next_iteration = csv_reading[index   1]  # get the next row
            if name not in next_iteration:  # check if the name is in the next row
                final_result.append(row)
        else:
            final_result.append(row)  # always append the last row
    return final_result

CodePudding user response：

The below code will outputs a list of lists. Each list has the first element as the key, with the remaining elements as the values. It keeps the latest:

[["Dan", ["17", "11", "10"]], ["Alex", ["11", "20", "30"]]]

import json

# Open the file in read mode
file = open("file", "r")
# Convert string into list
lst = file.read().split()

dic = {}

# Populate the first element of each line as the key
# The remaining elements are the values for the key
for line in lst:
  line = line.split(",")
  key, value = line[0], line[1:]
  dic[key] = value

# Convert dict into list
zip = list(zip(dic.keys(), dic.values()))
# Convert dictionary object into nested list
result = json.dumps(zip)

print(result)

CodePudding user response：

just covert it to dict and the coulm you need to not duplicate use this command

 list(dict.fromkeys(objectid))