I am trying to concatenate rows of data from this source
https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv
There are lines from the same speaker but they're broken into different rows in the dataframe. I'm trying to concatenate those speaker blocks into one row instead of 2 rows.
Here's what I've tried but it doesn't work. I'm still learning pandas and python.
url = (r'https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv')
data = pd.read_csv(url, on_bad_lines='skip')
data.drop('I', inplace=True, axis=1)
data.drop('I.1', inplace=True, axis=1)
data.rename(columns={'In delivering my son from me, I bury a second husband.': 'Text', 'COUNTESS': 'Speaker'}, inplace=True)
data = data.groupby(['Speaker'])['Text'].apply(' '.join).reset_index()
CodePudding user response:
A little ugly, but you can group by consecutive values using a helper series based on shift()
and cumsum()
. Then aggregating in the group by:
df = pd.read_csv('https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv',on_bad_lines='skip', names=['act','scene','char','line'])
g = df['char'].ne(df['char'].shift()).cumsum().rename('speakernumber')
df = df.groupby(g).agg({'act':'first', 'scene':'first', 'char':'first', 'line': ' '.join}).reset_index()
I believe this will work across acts and scenes as well althought I didn't dig in deep enough to test.