I've got a dataframe column which represents the order in which fruit was bought at a supermarket. The dataframe looks something like this:
mydict ={
'customer': ['Jack', 'Danny', 'Alex'],
'fruit_bought': ['apple#orange#apple', 'orange#apple', 'apple#banana#banana'],
}
df = pd.DataFrame(mydict)
customer | fruit_bought
-----------------------------
Jack | apple#orange#apple
Danny | orange#apple
Alex | apple#banana#banana
What I'd like to do is reduce the strings into the combination of unique fruit that the customer bought, which would look like this:
customer | fruit_bought
---------------------
Jack | apple#orange
Danny | apple#orange
Alex | apple#banana
I'm sure I can put together a long-winded apply
function to help with this, but I'm looking at 200,000 rows of data so I'd rather avoid using apply
here in favour of a vectorized approach. Can anyone please help me with this?
CodePudding user response:
You can use map
>>> df = pd.DataFrame(mydict)
>>> df
customer fruit_bought
0 Jack apple#orange#apple
1 Danny orange#apple
2 Alex apple#banana#banana
>>> df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
>>> df
customer fruit_bought Unique
0 Jack apple#orange#apple apple#orange
1 Danny orange#apple apple#orange
2 Alex apple#banana#banana apple#banana
CodePudding user response:
Use map
or list comprehension:
df['fruit_bought'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))
Or:
df['fruit_bought'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]
Output is same like solution with str.split
, because in sets are not defined order:
mydict ={
'customer': ['Jack', 'Danny', 'Alex'],
'fruit_bought': ['apple#orange#apple', 'orange#apple', 'apple#banana#banana'],
}
df = pd.DataFrame(mydict)
df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
df['fruit_bought1'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))
df['fruit_bought2'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]
df['fruit_bought3'] = df['fruit_bought'].map(lambda x: '#'.join(sorted(set(x.split('#')))))
df['fruit_bought4'] = ['#'.join(sorted(set(x.split('#')))) for x in df['fruit_bought']]
print (df)
customer fruit_bought Unique fruit_bought1 fruit_bought2 \
0 Jack apple#orange#apple orange#apple orange#apple orange#apple
1 Danny orange#apple orange#apple orange#apple orange#apple
2 Alex apple#banana#banana banana#apple banana#apple banana#apple
fruit_bought3 fruit_bought4
0 apple#orange apple#orange
1 apple#orange apple#orange
2 apple#banana apple#banana
With pandas functions it is slowier obviously:
df['fruit_bought'] = df.assign(fruit_bought = df['fruit_bought'].str.split('#')).reset_index().explode('fruit_bought').drop_duplicates(['index','fruit_bought']).groupby('index')['fruit_bought'].agg('#'.join)
print (df)
customer fruit_bought
0 Jack apple#orange
1 Danny orange#apple
2 Alex apple#banana
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [31]: %timeit df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
63.9 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [32]: %timeit df['fruit_bought'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))
34.7 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [33]: %timeit df['fruit_bought'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]
35.6 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [34]: %timeit df['fruit_bought'] = df.assign(fruit_bought = df['fruit_bought'].str.split('#')).reset_index().explode('fruit_bought').drop_duplicates(['index','fruit_bought']).groupby('index')['fruit_bought'].agg('#'.join)
497 ms ± 7.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [37]: %timeit df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
549 ms ± 9.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [38]: %timeit df['fruit_bought'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))
316 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [39]: %timeit df['fruit_bought'] = df['fruit_bought'].apply(lambda x: '#'.join(set(x.split('#'))))
325 ms ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [40]: %timeit df['fruit_bought'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]
316 ms ± 8.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [41]: %timeit df['fruit_bought'] = df.assign(fruit_bought = df['fruit_bought'].str.split('#')).reset_index().explode('fruit_bought').drop_duplicates(['index','fruit_bought']).groupby('index')['fruit_bought'].agg('#'.join)
5.25 s ± 326 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)