Home > Enterprise >  What method is best for accomplishing this Parsing task?(Python)
What method is best for accomplishing this Parsing task?(Python)

Time:03-03

I have a small parsing project I need to complete for work.

With some suggestion I can complete this just need some ideas and ideas how to best clean this up.

My task is to take a csv file(makes.csv) with lines like FYI I put those headers.

Input:

yearStart, yearEnd, make, model
2016, 2020, lamborghini, Aventador
2016, 2020, Chevrolet, Malibu
2016, 2019, Chevrolet, Cruze
2014, 2018, Mazda, 3
2016, 2018, Mazda, CX3
2012, 2018, Mazda, CX5
2014, 2014, Mazda, SPEED3
2013, 2018, Hyundai, Santa Fe
2015, 2015, Hyundai, Genesis
2013, 2014, Cadillac, ATS
2013, 2015, Cadillac, XTS

I need to parse this file to get data back in the format below

Chevrolet
  Camaro (16-20)
  Malibu (16-20)

Cadillac
  ATS    (12-14)
  XTS    (13-15)
Hyundai etc etc 

Essentially every Make needs to be parsed with the models printed below with their respective year start and end date.

I need a jog of memory or pseudo code to figure out how to do this logically.

Currently I have

import csv

with open('makes.csv','r') as csv_file:
    csv_reader = csv.DictReader(csv_file, skipinitialspace=True)

    with open('newoutput.csv', 'w') as new_file:
        fieldnames = ['yearStart','yearEnd','make','model']

        csv_writer=csv.DictWriter(new_file, fieldnames=fieldnames, extrasaction='ignore',delimiter='\t')

        csv_writer.writeheader()
        count = 0
        
        for line in csv_reader:
            count  = 1
            del line['yearStart']
            del line['yearEnd']
            del line['model']
            csv_writer.writerow(line)   

Using the csv file above I can get the below output

OUTPUT:

yearStart   yearEnd make    model

        lamborghini 

        Chevrolet   

        Chevrolet   

        Mazda   

        Mazda   

        Mazda   

        Mazda   

        Hyundai 

        Hyundai 

        Cadillac    

        Cadillac    

        Jeep    

        Lincoln 

        Lincoln 

        Kia 

So the question is what is the best way to compare the strings to only print a make once and then provide the makes below it.

Do I need to implement a data structure to compare strings or does a loop that counts how many times a certain string has been seen and then stops printing that Make.

Some thoughts: I looked through some REGEX documents and tutorials. Would that be helpful at all here?

Data structure to compare string needed and if yes what do you recommend?

What else am I missing?

CodePudding user response:

Read you file with pandas.read_csv, then iterate the groups, then rows.

Here is a dummy example with print, but you can save in a file:

import pandas as pd

# df  = pd.read_csv(...)

for k,g in df.groupby('make'):
    print(k)
    for _,r in g.iterrows():
        print(f'  {r["model"]} ({r["yearStart"]-2000:02d}-{r["yearEnd"]-2000:02d})')

output:

Cadillac
  ATS (13-14)
  XTS (13-15)
Chevrolet
  Malibu (16-20)
  Cruze (16-19)
Hyundai
  Santa Fe (13-18)
  Genesis (15-15)
Mazda
  3 (14-18)
  CX3 (16-18)
  CX5 (12-18)
  SPEED3 (14-14)
lamborghini
  Aventador (16-20)

CodePudding user response:

I suggest taking the input from the CSV file and transform it into a data structure that more closely resembles your desired output. In this case a list of dictionaries might be appropriate. If you are unfamiliar with lists and dictionaries, now is a good time to learn about them.

  • Related