Grouping data from csv file-CodePudding

I have data in the following format in csv file:

col1       col2    col3    col4
----       ----    ----    ----
 a          xy      11      66
 a          xz      11      66
 b          zz      99      33
 b          za      99      33

I would like to add them to a list in the following format(order of inner elements does not matter):

 list1 = [["a",["xy","xz"],11,66],["b",["zz","za"],99,33]]

I've tried so far:

with open(assignment_data_set) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')

    for row in csv_reader:
        protein[row[0]].append(row[8])
        protein[row[0]].append(row[5])
        protein[row[0]].append(row[1])

Result:

['66', 'xy', '11', '66', 'xz', '11']

Column 2 is not grouped and multiple entries for column 3.

Also tried:

with open(assignment_data_set) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')

    for row in csv_reader:
        protein[row[0]].append(row[8])
    
    for row in csv_reader:
        protein[row[0]].append(row[5])
        protein[row[0]].append(row[1])

Result:

['66', '66']

Only duplicated column 3.

Also tried:

with open(assignment_data_set) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')

for row in csv_reader:
    protein[row[0]] = [row[0],row[8],row[1]]
    protein[row[0]].append(row[5])

Result:

['a', '66', '11', 'xy']

'xy' entry is missing from col2-row2, it should be ['a', '66', '11', ['xy','xz']]

How can I add grouped 'col2' to 'col1' row and also add 'col3' and 'col4; to the same row?

CodePudding user response：

I see two issues with the code you've shared, and they both have to do with the protein data structure you've created.

For each new protein, you need to initialize that protein's name in your protein dict. The initialized value will be a list with an empty list and the two integers from col3 and col4:
```
protein[col1] = [[], col3, col4]
```
col2 will be appended to the empty list in a separate step...
Then you can append col2 to the first element in the value for col1, in a separate step:
```
protein[col1][0].append(col2)
```

Here's how that looks so far:

import csv

protein_builder = {}  # {key: [[str], int, int]}

with open("input.csv", newline="") as f:
    reader = csv.reader(f)

    for row in reader:
        col1 = row[0]
        col2 = row[1]
        col3 = int(row[2])  # convert to int
        col4 = int(row[3])

        if col1 not in protein_builder:
            protein_builder[col1] = [[], col3, col4]

        protein_builder[col1][0].append(col2)

print(protein_builder)

I ran that on your sample data and got:

{'a': [['xy', 'xz'], 11, 66], 'b': [['zz', 'za'], 99, 33]}

This isn't the final data structure you're looking for. We need one more step to take each protein name (the key in the dict) and join with its value into a single list, then append that to a list of all protein name/values:

protein_final = []

for k, v in protein_builder.items():
    final_row = [k]   v  # v is [[str], int, int]
    protein_final.append(final_row)

print(protein_final)

I ran that and got:

[['a', ['xy', 'xz'], 11, 66], ['b', ['zz', 'za'], 99, 33]]