Home > Back-end >  Concatenate rows based on name with pandas dataframe
Concatenate rows based on name with pandas dataframe

Time:09-21

I am trying to concatenate rows of data from this source

https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv

There are lines from the same speaker but they're broken into different rows in the dataframe. I'm trying to concatenate those speaker blocks into one row instead of 2 rows.

Here's what I've tried but it doesn't work. I'm still learning pandas and python.

url = (r'https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv')
data = pd.read_csv(url, on_bad_lines='skip')
data.drop('I', inplace=True, axis=1)
data.drop('I.1', inplace=True, axis=1)
data.rename(columns={'In delivering my son from me, I bury a second husband.': 'Text', 'COUNTESS': 'Speaker'}, inplace=True)
data = data.groupby(['Speaker'])['Text'].apply(' '.join).reset_index()

CodePudding user response:

A little ugly, but you can group by consecutive values using a helper series based on shift() and cumsum(). Then aggregating in the group by:

df = pd.read_csv('https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv',on_bad_lines='skip', names=['act','scene','char','line'])
g = df['char'].ne(df['char'].shift()).cumsum().rename('speakernumber')
df = df.groupby(g).agg({'act':'first', 'scene':'first', 'char':'first', 'line': ' '.join}).reset_index()

I believe this will work across acts and scenes as well althought I didn't dig in deep enough to test.

  • Related