Home > OS >  Iterating with a pair of rows when working with DataFrame
Iterating with a pair of rows when working with DataFrame

Time:04-16

To iterate through each line of a DataFrame I use .iterrows():

list_soccer = pd.DataFrame({
    'EventName': [obj_event.event.name for obj_event in matches],
    'IDEvent': [obj_event.event.id for obj_event in matches],
    'LocalEvent': [obj_event.event.venue for obj_event in matches],
    'CodeCountry': [obj_event.event.country_code for obj_event in matches],
    'TimeZone': [obj_event.event.time_zone for obj_event in matches],
    'OpenDate': [obj_event.event.open_date for obj_event in matches],
    'Total_Market': [obj_event.market_count for obj_event in matches],
    'Local_Date': [obj_evento.event.open_date.replace(tzinfo=datetime.timezone.utc).astimezone(tz=None) 
                            for obj_evento in matches]
    })

for_iterate = list_soccer.reset_index()
for_iterate = for_iterate[for_iterate['EventName'].str.contains(" v ")]
data_for_compare = (datetime.datetime.utcnow()).strftime("%Y-%m-%d %H:%M")
for_iterate = for_iterate[for_iterate['OpenDate'] >= data_for_compare]
    
for index, example_dataframe in for_iterate.iterrows():
    multiprocessing.Process(target=add_hour, args=(example_dataframe,))

As I need to double the speed of this iteration (to call two multiprocessing at the same time), I'm looking for a way to use two lines at a time.

If it was a regular list (please note that i am only giving this example below to demonstrate what i need, i understand that there is nothing similar between a list and a dataframe), I could do it like this:

a_list = ['a','b','c','d']
a_pairs = [a_list[i:i 2] for i in range(0, len(a_list)-1, 2)]
# a_pairs = [['a','b'],['c','d']]
for a, b in a_pairs:
    multiprocessing.Process(target=add_hour, args=(a,))
    multiprocessing.Process(target=add_hour, args=(b,))

How should I proceed with DataFrame to work with two rows at the same time?

In this question, I found two answers but they deliver options that repeat values inside the DataFrame:
Pandas iterate over DataFrame row pairs

What I am not able to create is a model so that the lines are not repeated, for example, using rows 0 and 1 then use 2 and 3 then use 4 and 5, so maybe someone says that the question is repeated, but in fact, my need is different and I was not able to transform those options into one for my necessity.

CodePudding user response:

You should be able to split the DataFrame in two, using similar indexing to as you do on lists.

Then, you can iterate over both at once, which gives you two rows at a time in order (so 0,1 then 2,3 etc)

df_a = for_iterate.iloc[::2] # Get all the even rows
df_b = for_iterate.iloc[1::2] # Get all the odd rows

for (_, example_dataframe_a), (_, example_dataframe_b) in zip(df_a.iterrows(), df_b.iterrows()):
    multiprocessing.Process(target=add_hour, args=(example_dataframe_a,))
    multiprocessing.Process(target=add_hour, args=(example_dataframe_b,))

(Although it's unclear to me why you need to spawn a process for each row of the dataframe, rather than two processes, one for each half of for_iterate).

Alternatively:

You could try using multiprocessing.Pool.map() to perform two requests at once. Unlike the above approach, a new request would be made as soon as a previous one completes (so it wouldn't wait for both to finish before dispatching the next two), and only two processes would be needed which could be re-used:

from multiprocessing import Pool

def add_hour_wrapper(data):
  # iterrows returns two arguments, we only want one
  _, row = data
  return add_hour(row)

pool = Pool(2) # 2 processes

pool.map(add_hour, for_iterate.iterrows())
  • Related