Merging rows in python csv-CodePudding

I have a csv file that contains data like that

Sample csv

Name	Start	End
John	12:00	13:00
John	12:10	13:00
John	12:20	13:20
Tom	12:00	13:10
John	13:50	14:00
Jerry	14:00	14:30
Alice	15:00	16:00
Jerry	11:00	15:00

I need to perform Merging operation such that

Before merge

John [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
Jerry [14:00,14:35],[11:00,15:00]
Tom [15:00,16:00]
Alice [12:00,13:10]

After merge

John [12:00,13:20],[13:50,14:00]
Jerry [11:00,15:00]
Alice [12:00 ,13:10]
Tom [15:00,16:00]

I am trying to use python CVS for that

import csv

dict = {}
with open('log.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        Name = row["Name"]
        rows = [row["Start"], row["End"]]
        if Name not in dict.keys():
            dict[Name] = [row["Start"], row["End"]]
    print(dict)

for entry in dict:
    print(entry,row["Start"], row["End"])

I am not sure how to approach that problem.

CodePudding user response：

with pandas it could look like this:

import pamdas as pd

df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
      sort_values(['Name','Start']))

df['split'] = (df.groupby('Name').
               apply(lambda x: (x.Start>x.End.shift().bfill()).cumsum()).
               droplevel(0))

res = (df.groupby(['Name','split']).
       apply(lambda x: [x.Start.min().strftime('%H-%M'),x.End.max().strftime('%H-%M')]).
       groupby('Name').
       apply(list).
       to_dict())

>>> res
'''
{'Alice': [['15-00', '16-00']],
 'Jerry': [['11-00', '15-00']],
 'John': [['12-00', '13-20'], ['13-50', '14-00']],
 'Tom': [['12-00', '13-10']]}

CodePudding user response：

pandas is specially well suited for this type of data manipulation. Also, holding your data as a dataframe is going to pay in the future if you decide to go into some more complicated analysis.

You can use groupby() to group by a column, 'Names' in this case, and then calculate things like max or min values. groupby is like filter in most spreadsheet programs, except that it has a name that actually describes what's doing (as expected from both python and spreadsheets).

And because you can convert the time strings in time format, you can be sure that the max and min are indeed going to be calculated considering that data represents time.

# Create a dataframe from a csv file
sample = pd.read_csv('Sample.csv', parse_dates=['Start', 'End'])

#Adjust the format for the dates
sample['Start'] = sample['Start'].dt.time
sample['End'] = sample['End'].dt.time

>>>sample

index	Name	Start	End
0	John	12:00:00	13:00:00
1	John	12:10:00	13:00:00
2	John	12:20:00	13:20:00
3	Tom	12:00:00	13:10:00
4	John	13:50:00	14:00:00
5	Jerry	14:00:00	14:30:00
6	Alice	15:00:00	16:00:00
7	Jerry	11:00:00	15:00:00

# Create new dataframe
new_df = pd.DataFrame()

# Create new_df's 'Start' and 'End' columns,
# grouping by 'Names' in the original dataframe
# and select max and min of each one
new_df['Sart'] = sample.groupby('Name')['Start'].min()
new_df['End'] = sample.groupby('Name')['End'].max()

>>>new_df

Name	Sart	End
Alice	15:00:00	16:00:00
Jerry	11:00:00	15:00:00
John	12:00:00	14:00:00
Tom	12:00:00	13:10:00

You can also create the new dataframe using the names as index: new_df = pd.DataFrame(index=sample['Name'].unique()), which seems to conserve the order (at least when I tried)