Convert unstructured CSV with filename rows-CodePudding

I'm working with a system that outputs non-standard CSV files. Row 1 always contains the filename, followed by an attribute for the table in row 2 (which sometimes include a comma), table headers in row 3, and then a varying number of data rows. After the data rows, there are always two blank lines and the pattern repeats (the headers are always the same within a file). Here is a small example:

Example Report
Geography:Boston, MA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,275
Week Ending 03-13-22,ITEM DESCRIPTION A,297
Week Ending 03-20-22,ITEM DESCRIPTION A,261


Example Report
Geography:New York, NY
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,393
Week Ending 03-13-22,ITEM DESCRIPTION A,477
Week Ending 03-20-22,ITEM DESCRIPTION A,412


Example Report
Geography:Philadelphia, PA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,195
Week Ending 03-13-22,ITEM DESCRIPTION A,233
Week Ending 03-20-22,ITEM DESCRIPTION A,198

Ultimately, I want to discard the filename & extra header rows and output a standard CSV with the attribute as the first column. This is what the example above should look like:

Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198

I'm used to manipulating standard CSV files in python, but this one is stumping me with the unstructured data mixed in.

CodePudding user response：

working solution that iterates over such CSV:

def read_strange_csv(filename):
    header_used = False
    with open(filename) as f:
        while True:
            line_filename = next(f).rstrip()
            line_attribute = next(f).rstrip()
            geography = line_attribute.split(':')[1]
            line_header = next(f).rstrip()
            if not header_used:
                yield f'Geography,{line_header}'
                header_used = True

            for line in f:
                line = line.rstrip()
                if not line:
                    break
                yield f'"{geography}",{line}'

            try:
                next(f)  # empty line
            except StopIteration:
                return


for row in read_strange_csv('example.csv'):
    print(row)

it prints out below line which you can save directly to a file if you need:

Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198

CodePudding user response：

the pandas.read_csv has a parameter skip_blank_lines=True by default. For the other things, I'll process them in pandas.

df1 = pd.read_csv('filename', skiprows=2, skipfooter=16)
df1[Geography']='Boston'

df2 = pd.read_csv('filename', skiprows=9, skipfooter=8)
df2[Geography']='Boston'

df3 = pd.read_csv('filename', skiprows=2, skipfooter=1) #adjust those, they might have errors
df3[Geography']='Boston'

df=pd.concat(df1,df2,df3)

I know it's hard to do this on multiple data, but that's the best solution that I can think. Good Luck with solving your problem!