a better way of cleaning lists in python-CodePudding

I have some text extracted from beautiful soup, unfortunately all of the text is in one and I'm saving it down as a list.

example:

souplist = [
['   Date', '', '  Fri 30th Apr 2021', '', ' 60084096-1', 'Type', '', '  Staff Travel (Rail)', '', 'Description', '', '', '  Stratford International', 'Amount', '', '', '', '', '   £25.10 Paid  '],

['   Date', '', '  Tue 27th Apr 2021', '', ' 60084096-3', 'Type', '', '  Office Costs (Stationery & printing)', '', 'Description', '', '', '  AMAZON.CO.UK [***]', 'Amount', '', '', '', '', '   £42.98 Paid  '],
['   Date', '', '  Tue 1st Dec 2020', '', ' 90012371-0', 'Type', '', '  Office Costs (Rent)', '', '  Amount', '', '', '', '', '   £3,500.00 Paid  '],
['   Date', '', '  Wed 14th Oct 2020', '', ' 60064831-1', 'Type', '', '  Office Costs (Software & applications)', '', 'Description', '', '', '  MAILCHIMP', 'MISC', 'Amount', '', '', '', '', '   £38.13 Paid  ']
]

I wanted to create this into a data frame with columns, date, id, type, description, amount.

I tried to do a for loop something like:

claims= {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for i in range(len(souplist)):
      
    claims['id'].append(newclaims[i][4])
    claims['date'].append(newclaims[i][2])
    claims['type'].append(newclaims[i][7])
    claims['description'].append(newclaims[i][12])
    claims['amount'].append(newclaims[i][18])

however in the larger dataset the position in the list changes and not all of the lists are the same length. I'm not really sure how to clean the list. please could you provide a better way?

CodePudding user response：

TL;DR use this:

souplist = [[val.strip() for val in lst if val] for lst in souplist]
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for lst in souplist:
    for key in claims:
        try:
            claims[key].append(lst[lst.index(key.title()) 1])
        except ValueError:
            if key == 'id':
                claims[key].append(lst[2])
            else:
                claims[key].append(None)

Bonus points: If you're going to use pandas (an awesome library!) you probably want to convert to Datetime your date column, index on your id column, and convert your amount to a number:

df = pd.DataFrame(claims).set_index('id')
df.date = pd.to_datetime(df.date)
df.amount = df.amount.str.replace(r'£|( Paid)|\.|,', '', regex=True).astype(int)
# treat your amount as an integer to get exact maths; then /100 whenever printing

souplist = [[val.strip() for val in lst if val] for lst in souplist]

will remove any empty strings ('') strip whitespace off your non-empty strings, in each list contained in souplist.

It seems some of your intended fields are labelled; e.g. 'Date' is the element immediately prior to the value you want. You can use this to your advantage:

claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for lst in souplist:
    for key in claims:
        try:
            # using key.title() to translate e.g. 'amount' -> 'Amount'
            # get the field from lst immediately after the label matching key.title()
            claims[key].append(lst[lst.index(key.title()) 1])
        except ValueError:
            if key == 'id':
                # would be better to change code that generates `souplist`
                # so this doesn't break
                claims[key].append(lst[2])
            else:
                # the key doesn't exist as a label in the soup list
                # append None so that all claims key-lists are same length
                claims[key].append(None)

This gives:

>>> claims
{'id':          ['60084096-1',
                 '60084096-3',
                 '90012371-0',
                 '60064831-1',
                ],
 'date':        ['Fri 30th Apr 2021',
                 'Tue 27th Apr 2021',
                 'Tue 1st Dec 2020',
                 'Wed 14th Oct 2020',
                ],
 'type':        ['Staff Travel (Rail)',
                 'Office Costs (Stationery & printing)',
                 'Office Costs (Rent)',
                 'Office Costs (Software & applications)',
                ],
 'description': ['Stratford International',
                 'AMAZON.CO.UK [***]',
                 None,
                 'MAILCHIMP',
                ],
 'amount':      ['£25.10 Paid',
                 '£42.98 Paid',
                 '£3,500.00 Paid',
                 '£38.13 Paid',
                ],
}