I'm working with a system that outputs non-standard CSV files. Row 1 always contains the filename, followed by an attribute for the table in row 2 (which sometimes include a comma), table headers in row 3, and then a varying number of data rows. After the data rows, there are always two blank lines and the pattern repeats (the headers are always the same within a file). Here is a small example:
Example Report
Geography:Boston, MA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,275
Week Ending 03-13-22,ITEM DESCRIPTION A,297
Week Ending 03-20-22,ITEM DESCRIPTION A,261
Example Report
Geography:New York, NY
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,393
Week Ending 03-13-22,ITEM DESCRIPTION A,477
Week Ending 03-20-22,ITEM DESCRIPTION A,412
Example Report
Geography:Philadelphia, PA
Time,Product,Unit Sales
Week Ending 03-06-22,ITEM DESCRIPTION A,195
Week Ending 03-13-22,ITEM DESCRIPTION A,233
Week Ending 03-20-22,ITEM DESCRIPTION A,198
Ultimately, I want to discard the filename & extra header rows and output a standard CSV with the attribute as the first column. This is what the example above should look like:
Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198
I'm used to manipulating standard CSV files in python, but this one is stumping me with the unstructured data mixed in.
CodePudding user response:
working solution that iterates over such CSV:
def read_strange_csv(filename):
header_used = False
with open(filename) as f:
while True:
line_filename = next(f).rstrip()
line_attribute = next(f).rstrip()
geography = line_attribute.split(':')[1]
line_header = next(f).rstrip()
if not header_used:
yield f'Geography,{line_header}'
header_used = True
for line in f:
line = line.rstrip()
if not line:
break
yield f'"{geography}",{line}'
try:
next(f) # empty line
except StopIteration:
return
for row in read_strange_csv('example.csv'):
print(row)
it prints out below line which you can save directly to a file if you need:
Geography,Time,Product,Unit Sales
"Boston, MA",Week Ending 03-06-22,ITEM DESCRIPTION A,275
"Boston, MA",Week Ending 03-13-22,ITEM DESCRIPTION A,297
"Boston, MA",Week Ending 03-20-22,ITEM DESCRIPTION A,261
"New York, NY",Week Ending 03-06-22,ITEM DESCRIPTION A,393
"New York, NY",Week Ending 03-13-22,ITEM DESCRIPTION A,477
"New York, NY",Week Ending 03-20-22,ITEM DESCRIPTION A,412
"Philadelphia, PA",Week Ending 03-06-22,ITEM DESCRIPTION A,195
"Philadelphia, PA",Week Ending 03-13-22,ITEM DESCRIPTION A,233
"Philadelphia, PA",Week Ending 03-20-22,ITEM DESCRIPTION A,198
CodePudding user response:
the pandas.read_csv has a parameter skip_blank_lines=True by default. For the other things, I'll process them in pandas.
df1 = pd.read_csv('filename', skiprows=2, skipfooter=16)
df1[Geography']='Boston'
df2 = pd.read_csv('filename', skiprows=9, skipfooter=8)
df2[Geography']='Boston'
df3 = pd.read_csv('filename', skiprows=2, skipfooter=1) #adjust those, they might have errors
df3[Geography']='Boston'
df=pd.concat(df1,df2,df3)
I know it's hard to do this on multiple data, but that's the best solution that I can think. Good Luck with solving your problem!