Home > Enterprise >  Restructure TSV to list of list of dicts
Restructure TSV to list of list of dicts

Time:08-16

A simplified look at my data right at parse:

[
{'id':'group1'},
{'id':'member1', 'parentId':'group1', 'size':51},
{'id':'member2', 'parentId':'group1', 'size':16},
{'id':'group2'},
{'id':'member1', 'parentId':'group2', 'size':21},
...
]

The desired output should be like this:

data =

[
[
{'id':'group1'},
{'id':'member1', 'parentId':'group1', 'size':51},
{'id':'member2', 'parentId':'group1', 'size':16}
],
[
{'id':'group2'},
{'id':'member1', 'parentId':'group2', 'size':21},
]

]

The issue is that it's very challenging to iterate through this kind of data structure because each list contains a different length of possible objects: some might have 10 some might have 3, making it unclear when to begin and end each list. And it's also not uniform. Note some have only 'id' entries and no 'parentId' or 'size' entries.

master_data = []
for i in range(len(tsv_data)):
    temp = {}
    for j in range(?????):
        ???

How can Python handle arranging vanilla .tsv data into a list of lists as seen above?

I thought one appropriate direction to take the code was to see if I could tally something simple, before tackling the whole data set. So I attempted to compute a count of all occurences of group1, based off this discussion:

group_counts = {}
for member in data:
    group = member.get('group1')
    try:
        group_counts[group]  = 1
    except KeyError:
        group_counts[group] = 1

However, this returned:

'list' object has no attribute 'get'

Which leads me to believe that counting text occurences may not be the solution afterall.

CodePudding user response:

You could fetch all groups to create the new datastructure afterwards add all the items:

data = [
    {
        'id': 'group1'
    }, {
        'id': 'member1',
        'parentId': 'group1',
        'size': 51
    }, {
        'id': 'member2',
        'parentId': 'group1',
        'size': 16
    }, {
        'id': 'group2'
    }, {
        'id': 'member1',
        'parentId': 'group2',
        'size': 21
    }, {
        'id': 'member3',
        'parentId': 'group1',
        'size': 16
    }
]

result = {} # Use a dict for easier grouping. 
lastGrpId = 0

# extract all groups
for dct in data:
    if 'group' in dct['id']:
        result[dct['id']] = [dct]

# extract all items and add to groups
for dct in data:
    if 'parentId' in dct:
        result[dct['parentId']].append(dct)

nestedListResult = [v for k, v in result.items()]

Out:

[
    [
        {
            'id': 'group1'
        }, {
            'id': 'member1',
            'parentId': 'group1',
            'size': 51
        }, {
            'id': 'member2',
            'parentId': 'group1',
            'size': 16
        }, {
            'id': 'member3',
            'parentId': 'group1',
            'size': 16
        }
    ], [{
        'id': 'group2'
    }, {
        'id': 'member1',
        'parentId': 'group2',
        'size': 21
    }]
]
  • Related