How to transform this into a dataframe and save it as a csv?-CodePudding

This data was earlier provided as a .txt file. I converted it to .csv format and tried to sort it into the wanted form, but failed. I am trying to find ways to convert this data structure (as displayed below):

bakeryA
77300 Baker Street
bun: [10,20,30,10]
donut: [20,10,40,0]
bread: [0,10,15,10]
bakery B
78100 Cerabut St
data not available
bakery C
80300 Sulkeh St
bun: [29,50,20,30]
donut: [10,10,30,10]
bread: [10,15,10,20]

into this data frame:

Name	Address	type	salt	sugar	water	flour
Bakery A	77300 Baker Street	bun	10	20	30	10
Bakery A	77300 Baker Street	donut	20	10	40	0
Bakery A	77300 Baker Street	bread	0	10	15	10
Bakery B	78100 Cerabut St	Nan	Nan	Nan	Nan	Nan
Bakery C	80300 Sulkeh St	bun	29	50	20	30
Bakery C	80300 Sulkeh St	donut	10	10	30	10
Bakery C	80300 Sulkeh St	bread	10	15	10	20

Thank you!

CodePudding user response：

This has very little to do with pandas and more with parsing non-structured source into structured data. Try this:

from ast import literal_eval
from enum import IntEnum

class LineType(IntEnum):
    BakeryName = 1
    Address = 2
    Ingredients = 3

data = []
with open('data.txt') as fp:
    line_type = LineType.BakeryName
    for line in fp:
        line = line.strip()

        if line_type == LineType.BakeryName:
            name = line # the current line contains the Bakery Name
            line_type = LineType.Address # the next line is the Bakery Address
        elif line_type == LineType.Address:
            address = line # the current line contains the Bakery Address
            line_type = LineType.Ingredients # the next line contains the Ingredients
        elif line_type == LineType.Ingredients and line == 'data not available':
            data.append({
                'Name': name,
                'Address': address
            }) # no Ingredients info available
            line_type = LineType.BakeryName # next line is Bakery Name
        elif line_type == LineType.Ingredients:
            # if the line does not follow the ingredient's format, we
            # overstepped into the Bakery Name line. Then the next line
            # is Bakery Address
            try:
                bakery_type, ingredients = line.split(':')
                ingredients = literal_eval(ingredients.strip())
                data.append({
                    'Name': name,
                    'Address': address,
                    'type': bakery_type,
                    'salt': ingredients[0],
                    'sugar': ingredients[1],
                    'water': ingredients[2],
                    'flour': ingredients[3],
                })
            except:
                name = line
                line_type = LineType.Address

df = pd.DataFrame(data)

This assume that your data file is in the format shown. A slight deviation (blank lines, for example) will throw it off.