Home > OS >  How to reduce dataframe column strings into core variations?
How to reduce dataframe column strings into core variations?

Time:11-22

I've got a dataframe column which represents the order in which fruit was bought at a supermarket. The dataframe looks something like this:

mydict ={
        'customer': ['Jack', 'Danny', 'Alex'],
        'fruit_bought': ['apple#orange#apple', 'orange#apple', 'apple#banana#banana'],
    }

df = pd.DataFrame(mydict) 

customer | fruit_bought
-----------------------------
Jack     | apple#orange#apple
Danny    | orange#apple
Alex     | apple#banana#banana

What I'd like to do is reduce the strings into the combination of unique fruit that the customer bought, which would look like this:

customer | fruit_bought
---------------------
Jack     | apple#orange
Danny    | apple#orange
Alex     | apple#banana

I'm sure I can put together a long-winded apply function to help with this, but I'm looking at 200,000 rows of data so I'd rather avoid using apply here in favour of a vectorized approach. Can anyone please help me with this?

CodePudding user response:

You can use map

>>> df = pd.DataFrame(mydict)
>>> df
  customer         fruit_bought
0     Jack   apple#orange#apple
1    Danny         orange#apple
2     Alex  apple#banana#banana
>>> df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
>>> df
  customer         fruit_bought        Unique
0     Jack   apple#orange#apple  apple#orange
1    Danny         orange#apple  apple#orange
2     Alex  apple#banana#banana  apple#banana

CodePudding user response:

Use map or list comprehension:

df['fruit_bought'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))

Or:

df['fruit_bought'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]

Output is same like solution with str.split, because in sets are not defined order:

mydict ={
        'customer': ['Jack', 'Danny', 'Alex'],
        'fruit_bought': ['apple#orange#apple', 'orange#apple', 'apple#banana#banana'],
    }

df = pd.DataFrame(mydict) 


df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')

df['fruit_bought1'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))
df['fruit_bought2'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]

df['fruit_bought3'] = df['fruit_bought'].map(lambda x: '#'.join(sorted(set(x.split('#')))))
df['fruit_bought4'] = ['#'.join(sorted(set(x.split('#')))) for x in df['fruit_bought']]
print (df)
  customer         fruit_bought        Unique fruit_bought1 fruit_bought2  \
0     Jack   apple#orange#apple  orange#apple  orange#apple  orange#apple   
1    Danny         orange#apple  orange#apple  orange#apple  orange#apple   
2     Alex  apple#banana#banana  banana#apple  banana#apple  banana#apple   

  fruit_bought3 fruit_bought4  
0  apple#orange  apple#orange  
1  apple#orange  apple#orange  
2  apple#banana  apple#banana  

With pandas functions it is slowier obviously:

df['fruit_bought'] = df.assign(fruit_bought = df['fruit_bought'].str.split('#')).reset_index().explode('fruit_bought').drop_duplicates(['index','fruit_bought']).groupby('index')['fruit_bought'].agg('#'.join)
print (df)
  customer  fruit_bought
0     Jack  apple#orange
1    Danny  orange#apple
2     Alex  apple#banana

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)


In [31]: %timeit df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
63.9 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: %timeit df['fruit_bought'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))
34.7 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]: %timeit df['fruit_bought'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]
35.6 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [34]: %timeit df['fruit_bought'] = df.assign(fruit_bought = df['fruit_bought'].str.split('#')).reset_index().explode('fruit_bought').drop_duplicates(['index','fruit_bought']).groupby('index')['fruit_bought'].agg('#'.join)
497 ms ± 7.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)

In [37]: %timeit df['Unique'] = df.fruit_bought.str.split('#').map(set).str.join('#')
549 ms ± 9.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [38]: %timeit df['fruit_bought'] = df['fruit_bought'].map(lambda x: '#'.join(set(x.split('#'))))
316 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [39]: %timeit df['fruit_bought'] = df['fruit_bought'].apply(lambda x: '#'.join(set(x.split('#'))))
325 ms ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [40]: %timeit df['fruit_bought'] = ['#'.join(set(x.split('#'))) for x in df['fruit_bought']]
316 ms ± 8.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [41]: %timeit df['fruit_bought'] = df.assign(fruit_bought = df['fruit_bought'].str.split('#')).reset_index().explode('fruit_bought').drop_duplicates(['index','fruit_bought']).groupby('index')['fruit_bought'].agg('#'.join)
5.25 s ± 326 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related