I have a small parsing project I need to complete for work.
With some suggestion I can complete this just need some ideas and ideas how to best clean this up.
My task is to take a csv file(makes.csv) with lines like FYI I put those headers.
Input:
yearStart, yearEnd, make, model
2016, 2020, lamborghini, Aventador
2016, 2020, Chevrolet, Malibu
2016, 2019, Chevrolet, Cruze
2014, 2018, Mazda, 3
2016, 2018, Mazda, CX3
2012, 2018, Mazda, CX5
2014, 2014, Mazda, SPEED3
2013, 2018, Hyundai, Santa Fe
2015, 2015, Hyundai, Genesis
2013, 2014, Cadillac, ATS
2013, 2015, Cadillac, XTS
I need to parse this file to get data back in the format below
Chevrolet
Camaro (16-20)
Malibu (16-20)
Cadillac
ATS (12-14)
XTS (13-15)
Hyundai etc etc
Essentially every Make needs to be parsed with the models printed below with their respective year start and end date.
I need a jog of memory or pseudo code to figure out how to do this logically.
Currently I have
import csv
with open('makes.csv','r') as csv_file:
csv_reader = csv.DictReader(csv_file, skipinitialspace=True)
with open('newoutput.csv', 'w') as new_file:
fieldnames = ['yearStart','yearEnd','make','model']
csv_writer=csv.DictWriter(new_file, fieldnames=fieldnames, extrasaction='ignore',delimiter='\t')
csv_writer.writeheader()
count = 0
for line in csv_reader:
count = 1
del line['yearStart']
del line['yearEnd']
del line['model']
csv_writer.writerow(line)
Using the csv file above I can get the below output
OUTPUT:
yearStart yearEnd make model
lamborghini
Chevrolet
Chevrolet
Mazda
Mazda
Mazda
Mazda
Hyundai
Hyundai
Cadillac
Cadillac
Jeep
Lincoln
Lincoln
Kia
So the question is what is the best way to compare the strings to only print a make once and then provide the makes below it.
Do I need to implement a data structure to compare strings or does a loop that counts how many times a certain string has been seen and then stops printing that Make.
Some thoughts: I looked through some REGEX documents and tutorials. Would that be helpful at all here?
Data structure to compare string needed and if yes what do you recommend?
What else am I missing?
CodePudding user response:
Read you file with pandas.read_csv
, then iterate the groups, then rows.
Here is a dummy example with print, but you can save in a file:
import pandas as pd
# df = pd.read_csv(...)
for k,g in df.groupby('make'):
print(k)
for _,r in g.iterrows():
print(f' {r["model"]} ({r["yearStart"]-2000:02d}-{r["yearEnd"]-2000:02d})')
output:
Cadillac
ATS (13-14)
XTS (13-15)
Chevrolet
Malibu (16-20)
Cruze (16-19)
Hyundai
Santa Fe (13-18)
Genesis (15-15)
Mazda
3 (14-18)
CX3 (16-18)
CX5 (12-18)
SPEED3 (14-14)
lamborghini
Aventador (16-20)
CodePudding user response:
I suggest taking the input from the CSV file and transform it into a data structure that more closely resembles your desired output. In this case a list of dictionaries might be appropriate. If you are unfamiliar with lists and dictionaries, now is a good time to learn about them.