I have column with following values:
d = {'id': [1, 2, 3, 4, 5],
'value': [['Red', 'Blue', 'Yellow'],
['Blue', 'Yellow', 'Orange'],
['Green', 'Purple', 'Yellow', 'Red'],
['Violet', 'Blue', 'Green', 'Red', 'Brown'],
['Blue', 'Green']]}
df = pd.DataFrame(data = d)
And I want to break down column values, tuples of strings, into pairs to form a new column or list like that
d = {'value': [['Red', 'Blue'],
['Blue', 'Yellow'],
['Blue', 'Yellow'],
['Yellow', 'Orange'],
['Green', 'Purple'],
['Purple', 'Yellow'],
['Yellow', 'Red'],
['Violet', 'Blue'],
['Blue', 'Green'],
['Green', 'Red'],
['Red', 'Brown'],
['Blue', 'Green']]}
df = pd.DataFrame(data = d)
I do the breaking with apply(lambda x:)
function, however it returns only one pair of values.
def splitter(row):
for first, second in zip(row, row[1:]):
return [first, second]
pairs_list = df_gr.status.apply(lambda x: splitter(x))
I know that it can be with iterrows()
loop but I'd like to know a more efficient method.
CodePudding user response:
Use list comprehension with window function and pass to DataFrame
constructor:
from itertools import islice
#https://stackoverflow.com/a/6822773/2901002
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] (elem,)
yield result
df = pd.DataFrame({'new': [list(y) for x in df['value'] for y in window(x)]})
print (df)
new
0 [Red, Blue]
1 [Blue, Yellow]
2 [Blue, Yellow]
3 [Yellow, Orange]
4 [Green, Purple]
5 [Purple, Yellow]
6 [Yellow, Red]
7 [Violet, Blue]
8 [Blue, Green]
9 [Green, Red]
10 [Red, Brown]
11 [Blue, Green]
Or simplier modify another solution (because working with nested lists):
window_size = 2
#https://stackoverflow.com/a/6822773/2901002
df = pd.DataFrame({'new': [x[i: i window_size] for x in df['value']
for i in range(len(x) - window_size 1)]})
print (df)
new
0 [Red, Blue]
1 [Blue, Yellow]
2 [Blue, Yellow]
3 [Yellow, Orange]
4 [Green, Purple]
5 [Purple, Yellow]
6 [Yellow, Red]
7 [Violet, Blue]
8 [Blue, Green]
9 [Green, Red]
10 [Red, Brown]
11 [Blue, Green]
CodePudding user response:
If you have python 3.10 installed, you can use the pairwise function directly; otherwise, the function below, copied from the itertools page suffices :
from itertools import tee, iterable
def pairwise(iterable):
# pairwise('ABCDEFG') --> AB BC CD DE EF FG
a, b = tee(iterable)
next(b, None)
return zip(a, b)
Dump the values
column into a list, wrangle it and recreate the dataframe:
out = df.value.tolist()
out = map(pairwise, out)
out = chain.from_iterable(out)
out = map(list, out)
out = [[ent] for ent in out]
pd.DataFrame(out, columns=['value'])
value
0 [Red, Blue]
1 [Blue, Yellow]
2 [Blue, Yellow]
3 [Yellow, Orange]
4 [Green, Purple]
5 [Purple, Yellow]
6 [Yellow, Red]
7 [Violet, Blue]
8 [Blue, Green]
9 [Green, Red]
10 [Red, Brown]
11 [Blue, Green]
More often than not, if you are wrangling native python structures, you are likely to get more performance if you deal with them in vanilla python instead of Pandas.