Home > Blockchain >  How to tackle csv files in Python without Pandas
How to tackle csv files in Python without Pandas

Time:09-14

I've been given a homework task to get data from a csv file without using Pandas. The info in the csv file contains headers such as...

  1. work year:
  2. experience level: EN Entry-level / Junior MI Mid-level / Inter- mediate SE Senior-level / Expert EX Executive-level / Director
  3. employment type: PT Part-time FT Full-time CT Contract FL Freelance
  4. job title:
  5. salary:
  6. salary currency:
  7. salaryinusd: The salary in USD
  8. employee residence: Employee’s primary country of residence
  9. remote ratio:

One of the questions is:

For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?

The only way I've managed to do this is to iterate through the csv and add a load of 'if' statements according to the experience level and job title, but this is taking me forever.

Any ideas of how to tackle this differently? Not using any libraries/modules.

Example of my code:


with open('/Users/xxx/Desktop/ds_salaries.csv', 'r') as f:
    csv_reader = f.readlines()
    for row in csv_reader[1:]:
        new_row = row.split(',')
        experience_level = new_row[2]
        job_title = new_row[4]
        salary_in_usd = new_row[7]

        if experience_level == 'EN' and job_title == 'AI Scientist':
            en_ai_scientist  = int(salary_in_usd)
            count_en_ai_scientist  = 1

avg_en_ai_scientist = en_ai_scientist / count_en_ai_scientist

print(avg_en_ai_scientist)

Data:

enter image description here

CodePudding user response:

When working out an example like this, I find it helpful to ask, "What data structure would make this easy question to answer?"

For example, the question asks

For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?

To me, this implies that I want a dictionary keyed by a tuple of experience level and job title, with the salaries of every person who matches. Something like this:

data = {
    ("EN", "AI Scientist"): [1000, 2000, 3000],
    ("SE", "AI Scientist"): [2000, 3000, 4000],
}

The next question is: how do I get my data into that format? I would read the data in with csv.DictReader, and add each salary number into the structure.

data = {}
with open('input.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        experience_level = row['first_name']
        job_title = row['last_name']
        key = experience_level, job_title
        if key not in data:
            # provide default value if no key exists
            # look at collections.defaultdict if you want to see a better way to do this
            data[key] = []
        data[key].append(row['salary_in_usd'])

Now that you have your data organized, you can compute average salaries:

for (experience_level, job_title), salary_data in data:
    print(experience_level, job_title, sum(salary_data)/len(salary_data))
  • Related