Home > Software engineering >  Merging rows in python csv
Merging rows in python csv

Time:08-07

I have a csv file that contains data like that

Sample csv

Name Start End
John 12:00 13:00
John 12:10 13:00
John 12:20 13:20
Tom 12:00 13:10
John 13:50 14:00
Jerry 14:00 14:30
Alice 15:00 16:00
Jerry 11:00 15:00

I need to perform Merging operation such that

Before merge

  • John [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
  • Jerry [14:00,14:35],[11:00,15:00]
  • Tom [15:00,16:00]
  • Alice [12:00,13:10]

After merge

  • John [12:00,13:20],[13:50,14:00]
  • Jerry [11:00,15:00]
  • Alice [12:00 ,13:10]
  • Tom [15:00,16:00]

I am trying to use python CVS for that

import csv

dict = {}
with open('log.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        Name = row["Name"]
        rows = [row["Start"], row["End"]]
        if Name not in dict.keys():
            dict[Name] = [row["Start"], row["End"]]
    print(dict)

for entry in dict:
    print(entry,row["Start"], row["End"])

I am not sure how to approach that problem.

CodePudding user response:

with pandas it could look like this:

import pamdas as pd

df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
      sort_values(['Name','Start']))

df['split'] = (df.groupby('Name').
               apply(lambda x: (x.Start>x.End.shift().bfill()).cumsum()).
               droplevel(0))

res = (df.groupby(['Name','split']).
       apply(lambda x: [x.Start.min().strftime('%H-%M'),x.End.max().strftime('%H-%M')]).
       groupby('Name').
       apply(list).
       to_dict())

>>> res
'''
{'Alice': [['15-00', '16-00']],
 'Jerry': [['11-00', '15-00']],
 'John': [['12-00', '13-20'], ['13-50', '14-00']],
 'Tom': [['12-00', '13-10']]}

CodePudding user response:

pandas is specially well suited for this type of data manipulation. Also, holding your data as a dataframe is going to pay in the future if you decide to go into some more complicated analysis.

You can use groupby() to group by a column, 'Names' in this case, and then calculate things like max or min values. groupby is like filter in most spreadsheet programs, except that it has a name that actually describes what's doing (as expected from both python and spreadsheets).

And because you can convert the time strings in time format, you can be sure that the max and min are indeed going to be calculated considering that data represents time.

# Create a dataframe from a csv file
sample = pd.read_csv('Sample.csv', parse_dates=['Start', 'End'])

#Adjust the format for the dates
sample['Start'] = sample['Start'].dt.time
sample['End'] = sample['End'].dt.time

>>>sample
index Name Start End
0 John 12:00:00 13:00:00
1 John 12:10:00 13:00:00
2 John 12:20:00 13:20:00
3 Tom 12:00:00 13:10:00
4 John 13:50:00 14:00:00
5 Jerry 14:00:00 14:30:00
6 Alice 15:00:00 16:00:00
7 Jerry 11:00:00 15:00:00
# Create new dataframe
new_df = pd.DataFrame()

# Create new_df's 'Start' and 'End' columns,
# grouping by 'Names' in the original dataframe
# and select max and min of each one
new_df['Sart'] = sample.groupby('Name')['Start'].min()
new_df['End'] = sample.groupby('Name')['End'].max()

>>>new_df
Name Sart End
Alice 15:00:00 16:00:00
Jerry 11:00:00 15:00:00
John 12:00:00 14:00:00
Tom 12:00:00 13:10:00

You can also create the new dataframe using the names as index: new_df = pd.DataFrame(index=sample['Name'].unique()), which seems to conserve the order (at least when I tried)

  • Related