replace values with randomly selected values in a pandas dataframe-CodePudding

Python 3.6, Pandas 1.1.5 on windows 10

Trying to optimize the below for better performance on large dataset.
Purpose: randomly select a single value if the data contains several values separated by a space.

For example, from:

    col1    col2    col3 
0   a      a b c    a c 
1   a b      c       a 
2   a b c    b       b

to:

   col1   col2      col3
0   a      b         c
1   a      c         a
2   b      b         b

So far:

df = pd.DataFrame({'col1': ['a', 'a b', 'a b c'], 
                   'col2':['a b c', 'c', 'b'], 
                   'col3':['a c', 'a', 'b'], })

# make data into a flat np.array
vals = list(itertools.chain.from_iterable(df.values))
vals_ = []

# randomly select a single value from each data point
for v in vals:
    v = v.split(' ')
    a = np.random.choice(len(v), 1)[0]
    v = v[a]
    vals_.append(v) 

gf = pd.DataFrame(np.array(vals_).reshape(df.shape), 
                  index = df.index, 
                  columns =df.columns)

This is not fast on a large dataset. Any lead will be appreciated.

CodePudding user response：

Defining a function and applying it to the entire Pandas dataframe via

The function could be implemented via

def rndVal(x:str):
    if len(x) > 1:
        x = x.split(' ')
        a = np.random.choice(len(x), 1)[0]
        return x[a]
    else:
        return x

and is applicable with

df.applymap(rndVal)

returning

Regarding Performance. Running your attempt and applymap on a dataframe with 300,000 rows requires the former 18.6 s while this solution only takes 8.4 s.

CodePudding user response：

Pandas fast approach

Stack to reshape then split and explode the strings then groupby on multiindex and draw a sample of size 1 per group, then unstack back to reshape

(
    df.stack().str.split().explode()
      .groupby(level=[0, 1]).sample(1).unstack()
)

  col1 col2 col3
0    a    a    c
1    b    c    a
2    a    b    b