I am sure someone has asked a question like this before but my current attempts to search have not yielded a solution.
I have a column of text values, for example:
import pandas as pd
df2 = pd.DataFrame({'text':['a','bb','cc','4','m','...']})
print(df2)
text
0 a
1 bb
2 cc
3 4
4 m
5 ...
The column in 'text' is comprised of strings, ints, floats, and nan type data.
I am trying to combine (with a space [' '] between each text value) all the text values in-between each number (int/float) in the text column, ignoring Nan values, making each concatenated set a separate row.
What would be the most efficient way to accomplish this?
I thought to possibly read all values into a string, strip the Nan's, then split this successively if a number is encountered, but this seems highly inefficient.
Thank you for your help!
edit:
desired sample output
text
0 'a bb cc'
1 'm ...'
CodePudding user response:
You can convert columns to numeric and test non missing values, so get True
s for numeric rows, then filter only non numeric in inverted mask by ~
in DataFrame.loc
and aggregate by cumulative sum with mask by Series.cumsum
with aggregate join
:
#for remove NaNs before solution
df2 = df2.dropna(subset=['text'])
m = pd.to_numeric(df2['text'], errors='coerce').notna()
df = df2.loc[~m, 'text'].groupby(m.cumsum()).agg(' '.join).reset_index(drop=True).to_frame()
print (df)
text
0 a bb cc
1 m ...
CodePudding user response:
I would avoid pandas for this operation altogether. Instead, use the library module more_itertools
- namely, the split_at()
function:
import more_itertools as mit
def test(x): # Test if X is a number of some sort or a nan
try: float(x); return True
except: return False
result = [" ".join(x) for x in mit.split_at(df2['text'].dropna(), test)]
# ['a bb cc', 'm ...']
df3 = pd.DataFrame(result, columns=['text',])
P.S. On a dataframe of 13,000 rows with an average group length of 10, this solution is 2 times faster than the pandas solution proposed by jezrael (0.00087 sec vs 0.00156 sec). Not a huge difference, indeed.