I've been given a homework task to get data from a csv file without using Pandas. The info in the csv file contains headers such as...
- work year:
- experience level: EN Entry-level / Junior MI Mid-level / Inter- mediate SE Senior-level / Expert EX Executive-level / Director
- employment type: PT Part-time FT Full-time CT Contract FL Freelance
- job title:
- salary:
- salary currency:
- salaryinusd: The salary in USD
- employee residence: Employee’s primary country of residence
- remote ratio:
One of the questions is:
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
The only way I've managed to do this is to iterate through the csv and add a load of 'if' statements according to the experience level and job title, but this is taking me forever.
Any ideas of how to tackle this differently? Not using any libraries/modules.
Example of my code:
with open('/Users/xxx/Desktop/ds_salaries.csv', 'r') as f:
csv_reader = f.readlines()
for row in csv_reader[1:]:
new_row = row.split(',')
experience_level = new_row[2]
job_title = new_row[4]
salary_in_usd = new_row[7]
if experience_level == 'EN' and job_title == 'AI Scientist':
en_ai_scientist = int(salary_in_usd)
count_en_ai_scientist = 1
avg_en_ai_scientist = en_ai_scientist / count_en_ai_scientist
print(avg_en_ai_scientist)
Data:
CodePudding user response:
When working out an example like this, I find it helpful to ask, "What data structure would make this easy question to answer?"
For example, the question asks
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
To me, this implies that I want a dictionary keyed by a tuple of experience level and job title, with the salaries of every person who matches. Something like this:
data = {
("EN", "AI Scientist"): [1000, 2000, 3000],
("SE", "AI Scientist"): [2000, 3000, 4000],
}
The next question is: how do I get my data into that format? I would read the data in with csv.DictReader, and add each salary number into the structure.
data = {}
with open('input.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
experience_level = row['first_name']
job_title = row['last_name']
key = experience_level, job_title
if key not in data:
# provide default value if no key exists
# look at collections.defaultdict if you want to see a better way to do this
data[key] = []
data[key].append(row['salary_in_usd'])
Now that you have your data organized, you can compute average salaries:
for (experience_level, job_title), salary_data in data:
print(experience_level, job_title, sum(salary_data)/len(salary_data))