I'm trying to avoid using iterrows()
in pandas and achieve a more performant solution. This is the code I have, where I loop through a DataFrame and for each record I need to add three more:
import pandas as pd
fruit_data = pd.DataFrame({
'fruit': ['apple','orange','pear','orange'],
'color': ['red','orange','green','green'],
'weight': [5,6,3,4]
})
array = []
for index, row in fruit_data.iterrows():
row2 = { 'fruit_2': row['fruit'], 'sequence': 0}
array.append(row2)
for i in range(2):
row2 = { 'fruit_2': row['fruit'], 'sequence': i 1}
array.append(row2)
print(array)
My real DataFrame has millions of records. Is there a way to optimize this code and NOT use iterrows()
or for
loops?
CodePudding user response:
You could use repeat
to repeat each fruit 3 times; then groupby
cumcount
to assign sequence
numbers; finally to_dict
for the final output:
tmp = fruit_data['fruit'].repeat(3).reset_index(name='fruit_2')
tmp['sequence'] = tmp.groupby('index').cumcount()
out = tmp.drop(columns='index').to_dict('records')
Output:
[{'fruit_2': 'apple', 'sequence': 0},
{'fruit_2': 'apple', 'sequence': 1},
{'fruit_2': 'apple', 'sequence': 2},
{'fruit_2': 'orange', 'sequence': 0},
{'fruit_2': 'orange', 'sequence': 1},
{'fruit_2': 'orange', 'sequence': 2},
{'fruit_2': 'pear', 'sequence': 0},
{'fruit_2': 'pear', 'sequence': 1},
{'fruit_2': 'pear', 'sequence': 2},
{'fruit_2': 'orange', 'sequence': 0},
{'fruit_2': 'orange', 'sequence': 1},
{'fruit_2': 'orange', 'sequence': 2}]
CodePudding user response:
Try this out:
array = (
fruit_data['fruit']
.repeat(3)
.to_frame(name='fruit_2')
.set_index(np.tile(np.arange(3), len(fruit_data['fruit'])))
.reset_index()
.rename({'index':'sequence'},axis=1)
[['fruit_2', 'sequence']]
.to_dict('records')
)
Output:
>>> array
[{'fruit_2': 'apple', 'sequence': 0},
{'fruit_2': 'apple', 'sequence': 1},
{'fruit_2': 'apple', 'sequence': 2},
{'fruit_2': 'orange', 'sequence': 0},
{'fruit_2': 'orange', 'sequence': 1},
{'fruit_2': 'orange', 'sequence': 2},
{'fruit_2': 'pear', 'sequence': 0},
{'fruit_2': 'pear', 'sequence': 1},
{'fruit_2': 'pear', 'sequence': 2},
{'fruit_2': 'orange', 'sequence': 0},
{'fruit_2': 'orange', 'sequence': 1},
{'fruit_2': 'orange', 'sequence': 2}]