Home > OS >  Python: Splitting a Column into concatenated rows based on specific Values
Python: Splitting a Column into concatenated rows based on specific Values

Time:12-23

I am sure someone has asked a question like this before but my current attempts to search have not yielded a solution.

I have a column of text values, for example:

import pandas as pd
df2 = pd.DataFrame({'text':['a','bb','cc','4','m','...']})
print(df2)

  text
0    a
1   bb
2   cc
3    4
4    m
5  ...

The column in 'text' is comprised of strings, ints, floats, and nan type data.

I am trying to combine (with a space [' '] between each text value) all the text values in-between each number (int/float) in the text column, ignoring Nan values, making each concatenated set a separate row.

What would be the most efficient way to accomplish this?

I thought to possibly read all values into a string, strip the Nan's, then split this successively if a number is encountered, but this seems highly inefficient.

Thank you for your help!

edit:

desired sample output

     text
0    'a bb cc'
1    'm ...'

CodePudding user response:

You can convert columns to numeric and test non missing values, so get Trues for numeric rows, then filter only non numeric in inverted mask by ~ in DataFrame.loc and aggregate by cumulative sum with mask by Series.cumsum with aggregate join:

#for remove NaNs before solution
df2 = df2.dropna(subset=['text'])

m = pd.to_numeric(df2['text'], errors='coerce').notna()

df = df2.loc[~m, 'text'].groupby(m.cumsum()).agg(' '.join).reset_index(drop=True).to_frame()
print (df)
      text
0  a bb cc
1    m ...

CodePudding user response:

I would avoid pandas for this operation altogether. Instead, use the library module more_itertools - namely, the split_at() function:

import more_itertools as mit

def test(x): # Test if X is a number of some sort or a nan
    try: float(x); return True
    except: return False

result = [" ".join(x) for x in mit.split_at(df2['text'].dropna(), test)]
# ['a bb cc', 'm ...']
df3 = pd.DataFrame(result, columns=['text',])

P.S. On a dataframe of 13,000 rows with an average group length of 10, this solution is 2 times faster than the pandas solution proposed by jezrael (0.00087 sec vs 0.00156 sec). Not a huge difference, indeed.

  • Related