Home > other >  Convert unstructured CSV with filename rows
Convert unstructured CSV with filename rows

Time:04-23

I'm working with a system that outputs non-standard CSV files. Row 1 always contains the filename, followed by an attribute for the table in row 2 (which sometimes include a comma), table headers in row 3, and then a varying number of data rows. After the data rows, there are always two blank lines and the pattern repeats (the headers are always the same within a file). Here is a small example:

Example Report
Geography:Boston, MA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,275
Week Ending 03-13-22,ITEM DESCRIPTION A,297
Week Ending 03-20-22,ITEM DESCRIPTION A,261


Example Report
Geography:New York, NY
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,393
Week Ending 03-13-22,ITEM DESCRIPTION A,477
Week Ending 03-20-22,ITEM DESCRIPTION A,412


Example Report
Geography:Philadelphia, PA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,195
Week Ending 03-13-22,ITEM DESCRIPTION A,233
Week Ending 03-20-22,ITEM DESCRIPTION A,198

Ultimately, I want to discard the filename & extra header rows and output a standard CSV with the attribute as the first column. This is what the example above should look like:

Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198

I'm used to manipulating standard CSV files in python, but this one is stumping me with the unstructured data mixed in.

CodePudding user response:

working solution that iterates over such CSV:

def read_strange_csv(filename):
    header_used = False
    with open(filename) as f:
        while True:
            line_filename = next(f).rstrip()
            line_attribute = next(f).rstrip()
            geography = line_attribute.split(':')[1]
            line_header = next(f).rstrip()
            if not header_used:
                yield f'Geography,{line_header}'
                header_used = True

            for line in f:
                line = line.rstrip()
                if not line:
                    break
                yield f'"{geography}",{line}'

            try:
                next(f)  # empty line
            except StopIteration:
                return


for row in read_strange_csv('example.csv'):
    print(row)

it prints out below line which you can save directly to a file if you need:

Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198

CodePudding user response:

the pandas.read_csv has a parameter skip_blank_lines=True by default. For the other things, I'll process them in pandas.

df1 = pd.read_csv('filename', skiprows=2, skipfooter=16)
df1[Geography']='Boston'

df2 = pd.read_csv('filename', skiprows=9, skipfooter=8)
df2[Geography']='Boston'

df3 = pd.read_csv('filename', skiprows=2, skipfooter=1) #adjust those, they might have errors
df3[Geography']='Boston'

df=pd.concat(df1,df2,df3)

I know it's hard to do this on multiple data, but that's the best solution that I can think. Good Luck with solving your problem!

  • Related