Create new column from existing one with more values in it-CodePudding

I have column with following values:

d = {'id': [1, 2, 3, 4, 5],
     'value': [['Red', 'Blue', 'Yellow'],
               ['Blue', 'Yellow', 'Orange'],
               ['Green', 'Purple', 'Yellow', 'Red'],
               ['Violet', 'Blue', 'Green', 'Red', 'Brown'],
               ['Blue', 'Green']]}

df = pd.DataFrame(data = d)

And I want to break down column values, tuples of strings, into pairs to form a new column or list like that

d = {'value': [['Red', 'Blue'],
               ['Blue', 'Yellow'],
               ['Blue', 'Yellow'],
               ['Yellow', 'Orange'],
               ['Green', 'Purple'],
               ['Purple', 'Yellow'],
               ['Yellow', 'Red'],
               ['Violet', 'Blue'],
               ['Blue', 'Green'],
               ['Green', 'Red'],
               ['Red', 'Brown'],
               ['Blue', 'Green']]}

df = pd.DataFrame(data = d)

I do the breaking with apply(lambda x:) function, however it returns only one pair of values.

def splitter(row):
    for first, second in zip(row, row[1:]):
        return [first, second]

pairs_list = df_gr.status.apply(lambda x: splitter(x))

I know that it can be with iterrows() loop but I'd like to know a more efficient method.

CodePudding user response：

Use list comprehension with window function and pass to DataFrame constructor:

from itertools import islice

#https://stackoverflow.com/a/6822773/2901002
def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:]   (elem,)
        yield result
        
df = pd.DataFrame({'new': [list(y) for x in df['value'] for y in window(x)]})
print (df)
                 new
0        [Red, Blue]
1     [Blue, Yellow]
2     [Blue, Yellow]
3   [Yellow, Orange]
4    [Green, Purple]
5   [Purple, Yellow]
6      [Yellow, Red]
7     [Violet, Blue]
8      [Blue, Green]
9       [Green, Red]
10      [Red, Brown]
11     [Blue, Green]

Or simplier modify another solution (because working with nested lists):

window_size = 2

#https://stackoverflow.com/a/6822773/2901002
df = pd.DataFrame({'new': [x[i: i   window_size] for x in df['value'] 
                           for i in range(len(x) - window_size   1)]})
print (df)
                 new
0        [Red, Blue]
1     [Blue, Yellow]
2     [Blue, Yellow]
3   [Yellow, Orange]
4    [Green, Purple]
5   [Purple, Yellow]
6      [Yellow, Red]
7     [Violet, Blue]
8      [Blue, Green]
9       [Green, Red]
10      [Red, Brown]
11     [Blue, Green]

CodePudding user response：

If you have python 3.10 installed, you can use the pairwise function directly; otherwise, the function below, copied from the itertools page suffices :

from itertools import tee, iterable
def pairwise(iterable):
    # pairwise('ABCDEFG') --> AB BC CD DE EF FG
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

Dump the values column into a list, wrangle it and recreate the dataframe:

out = df.value.tolist()
out = map(pairwise, out)
out = chain.from_iterable(out)
out = map(list, out)
out = [[ent] for ent in out]
pd.DataFrame(out, columns=['value'])

               value
0        [Red, Blue]
1     [Blue, Yellow]
2     [Blue, Yellow]
3   [Yellow, Orange]
4    [Green, Purple]
5   [Purple, Yellow]
6      [Yellow, Red]
7     [Violet, Blue]
8      [Blue, Green]
9       [Green, Red]
10      [Red, Brown]
11     [Blue, Green]

More often than not, if you are wrangling native python structures, you are likely to get more performance if you deal with them in vanilla python instead of Pandas.