I have a csv file that contains data like that
Sample csv
Name | Start | End |
---|---|---|
John | 12:00 | 13:00 |
John | 12:10 | 13:00 |
John | 12:20 | 13:20 |
Tom | 12:00 | 13:10 |
John | 13:50 | 14:00 |
Jerry | 14:00 | 14:30 |
Alice | 15:00 | 16:00 |
Jerry | 11:00 | 15:00 |
I need to perform Merging operation such that
Before merge
- John [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
- Jerry [14:00,14:35],[11:00,15:00]
- Tom [15:00,16:00]
- Alice [12:00,13:10]
After merge
- John [12:00,13:20],[13:50,14:00]
- Jerry [11:00,15:00]
- Alice [12:00 ,13:10]
- Tom [15:00,16:00]
I am trying to use python CVS for that
import csv
dict = {}
with open('log.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
Name = row["Name"]
rows = [row["Start"], row["End"]]
if Name not in dict.keys():
dict[Name] = [row["Start"], row["End"]]
print(dict)
for entry in dict:
print(entry,row["Start"], row["End"])
I am not sure how to approach that problem.
CodePudding user response:
with pandas it could look like this:
import pamdas as pd
df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
sort_values(['Name','Start']))
df['split'] = (df.groupby('Name').
apply(lambda x: (x.Start>x.End.shift().bfill()).cumsum()).
droplevel(0))
res = (df.groupby(['Name','split']).
apply(lambda x: [x.Start.min().strftime('%H-%M'),x.End.max().strftime('%H-%M')]).
groupby('Name').
apply(list).
to_dict())
>>> res
'''
{'Alice': [['15-00', '16-00']],
'Jerry': [['11-00', '15-00']],
'John': [['12-00', '13-20'], ['13-50', '14-00']],
'Tom': [['12-00', '13-10']]}
CodePudding user response:
pandas is specially well suited for this type of data manipulation. Also, holding your data as a dataframe is going to pay in the future if you decide to go into some more complicated analysis.
You can use groupby()
to group by a column, 'Names' in this case, and then calculate things like max or min values. groupby
is like filter
in most spreadsheet programs, except that it has a name that actually describes what's doing (as expected from both python and spreadsheets).
And because you can convert the time strings in time format, you can be sure that the max and min are indeed going to be calculated considering that data represents time.
# Create a dataframe from a csv file
sample = pd.read_csv('Sample.csv', parse_dates=['Start', 'End'])
#Adjust the format for the dates
sample['Start'] = sample['Start'].dt.time
sample['End'] = sample['End'].dt.time
>>>sample
index | Name | Start | End |
---|---|---|---|
0 | John | 12:00:00 | 13:00:00 |
1 | John | 12:10:00 | 13:00:00 |
2 | John | 12:20:00 | 13:20:00 |
3 | Tom | 12:00:00 | 13:10:00 |
4 | John | 13:50:00 | 14:00:00 |
5 | Jerry | 14:00:00 | 14:30:00 |
6 | Alice | 15:00:00 | 16:00:00 |
7 | Jerry | 11:00:00 | 15:00:00 |
# Create new dataframe
new_df = pd.DataFrame()
# Create new_df's 'Start' and 'End' columns,
# grouping by 'Names' in the original dataframe
# and select max and min of each one
new_df['Sart'] = sample.groupby('Name')['Start'].min()
new_df['End'] = sample.groupby('Name')['End'].max()
>>>new_df
Name | Sart | End |
---|---|---|
Alice | 15:00:00 | 16:00:00 |
Jerry | 11:00:00 | 15:00:00 |
John | 12:00:00 | 14:00:00 |
Tom | 12:00:00 | 13:10:00 |
You can also create the new dataframe using the names as index: new_df = pd.DataFrame(index=sample['Name'].unique())
, which seems to conserve the order (at least when I tried)