I have a csv file which contains data like that
Sample csv
Name | Start | End |
---|---|---|
John | 12:00 | 13:00 |
John | 12:10 | 13:00 |
John | 12:20 | 13:20 |
Tom | 12:00 | 13:10 |
John | 13:50 | 14:00 |
Jerry | 14:00 | 14:30 |
Alice | 15:00 | 16:00 |
Jerry | 11:00 | 15:00 |
I need to find the average time taken by each people in python. How do i do that?
Sample output
Avg time taken by different people are :
John (60 50 60 10)/4 min Tom (70)/1 min Jerry (30 240)/2 min Alice (60)/1 min
I tried parsing the csv file by python csv
import datetime
import csv
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row['Start'],row['End'])
But i am unable to parse the column with the particular row name belongs to Jerry and find the difference in their time.
- Also Need to find which Person took maximum time
Any help will be appreciated.
CodePudding user response:
A simplified code below. You can write it in fewer lines and further in fewer lines by using pandas.
import csv
from datetime import datetime
avg = {}
with open('people.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
name = row["Name"]
start_time = datetime.strptime(row["Start"], "%H:%M")
end_time = datetime.strptime(row["End"], "%H:%M")
time_taken = (end_time - start_time).total_seconds() / 60.0
if name not in avg.keys():
avg[name] = [time_taken,1]
else:
prev_time_total = avg[name][0]
prev_count = avg[name][1]
new_time_total = prev_time_total time_taken
new_count = prev_count 1
avg[name] = [new_time_total,new_count]
for entry in avg:
print(entry,avg[entry][0]/avg[entry][1])
CodePudding user response:
without Pandas:
times = {}
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
if row['name'] not in times.keys():
times[row['name']] = []
times[row['name']].append(row['End'] - row['Start'])
for person in times.keys():
print(person ": " str(sum(times[person]) / len(times[person])))
CodePudding user response:
If you are willing to use pandas for this, the code below can do the job -
import pandas as pd
df = pd.read_csv("data.csv")
df = df.apply(pd.to_datetime, errors = "ignore")
time_df = df.iloc[:, 1:].diff(axis = 1).drop(columns = "Start").rename(columns = {"End" : "Time Diff"})
time_df["Name"] = df["Name"]
time_df.groupby("Name").mean()
Output -
Time Diff | |
---|---|
Alice | 0 days 01:00:00 |
Jerry | 0 days 02:15:00 |
John | 0 days 00:45:00 |
Tom | 0 days 01:10:00 |
Code Explanation -
The 3rd line in the code reads the csv file you have and converts it into a pandas dataframe. A dataframe is just a table and can be manipulated in the same way.
The 4th line in the code coverts all the values in valid columns to datetime which can help you in finding the time difference. I have passed the parameter
errors = "ignore"
as the first column in the dataframe, the columnName
, has string values that cannot be converted to datetime. Passing in theerrors
parameter asignore
would allow me to retain all the original values of that column.The 5th line of code selects the columns from index 1 onwards and substracts them. Once that's done the
drop
function gets implemented and the redundant column with null values. Once that's done therename
function kicks in and renames theEnd
column toTime Diff
. All this is stored in a new variable by the nametime_df
.Because
time_df
doens't have the name column in it, the 6th line adds it.Once I have the required dataframe, I just group the data based on the
Name
column, meaning all the data belonging to particular person would be worked on separately. This is ideal for us as we want to find the mean time taken by every person. To do that we just apply themean
function on the grouped data and voila we have the desired output.
CodePudding user response:
Here's another method without using pandas
-
from datetime import datetime, timedelta
with open("data.csv", "r") as f:
f = csv.DictReader(f)
data = [row for row in f]
diffs = {list(row.values())[0]: [] for row in data}
for row in data:
vals = list(row.values())
diffs[vals[0]].append(datetime.strptime(vals[2], "%H:%M") - datetime.strptime(vals[1], "%H:%M"))
diffs_avg = [str(timedelta(seconds = sum(map(timedelta.total_seconds, times)) / len(times))) for times in diffs.values()]
dict(zip(diffs.keys(), diffs_avg))
Output -
{'Alice': '1:00:00', 'Jerry': '2:15:00', 'John': '0:45:00', 'Tom': '1:10:00'}