Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
2 1 'yo'
4 2 'hello'
5 2 'sup'
6 3 'howdy'
7 3 'hello'
8 3 'goodbye'
Expected Output:
idx uniqueID String
0 1 'greeting'
2 1 'yo'
4 2 'hello'
5 2 'sup'
6 3 'howdy'
7 3 'greeting'
Effectively, if there is a consecutive pattern of 'hello' then 'goodbye' within the same uniqueID, they get merged to a single row and become a different string (which in this case I allocated to be 'greeting').
Question: How do I combine two rows to one, only for specific values, with an updated value I want?
I'm thinking use of .loc would be ideal here, but unsure how to go about it. Thank you!
CodePudding user response:
In your case do
idx = df.groupby('uniqueID')['String'].apply(lambda x : x.shift().eq('hello')&x.eq('goodbye'))
out = df.drop(idx.index[idx],axis=0)
out.loc[idx.index[idx]-1,'String'] = 'greeting'
out
Out[79]:
idx uniqueID String
0 0 1 greeting
2 2 1 yo
3 4 2 hello
4 5 2 sup
5 6 3 howdy
6 7 3 greeting
CodePudding user response:
here is one more way do it,
#using mask check if the consecutive values are hello and goodbye, and change the first value to greetings
df['String']=df['String'].mask(df['uniqueID'].eq(df['uniqueID'].shift(-1)) &
df['String'].str[1:-1].eq('hello') &
df['String'].shift(-1).str[1:-1].eq('goodbye'),'greetings')
df
# identify the row above the greetings and drop it
df.drop(df.loc[df['String'].shift().eq('greetings') ].index)
idx uniqueID String
0 0 1 greetings
2 2 1 'yo'
3 4 2 'hello'
4 5 2 'sup'
5 6 3 'howdy'
6 7 3 greetings