Split the column of dataframe into multiple columns according to the text length of column on Python-CodePudding

How can I separate a column of pandas dataframe into multiple columns based on the size of each text length? Assume that chunk size will be 3 and the sample dataframe is:

id body
1 abcdefgh
2 xyzk

For this case, I would like to get:

id body1 body2 body3
1  abc  def gh
2  xyz  k

I am assuming that I should be able to handle it with something like : df[['body1','body2', 'body3']] = df['body'].str.split(...

Any suggestions?

CodePudding user response：

You can do the following:

new_values = df['body'].str.findall('.{1,3}')
new_columns = [f'body{num}' for num in range(1, new_values.apply(len).max()  1)]
new_df = pd.DataFrame(data=new_values.tolist(), columns=new_columns)

You can also define your regex pattern based on the maximum number of characters you want on each column:

max_char_per_column = 3
regex_pattern = f".{{1,{max_char_per_column}}}"
new_values = df['body'].str.findall(regex_pattern)

If you don't want the None, feel free to .fillna("") your new_df.

See this answer for splitting a string with regex every nth character Split string every nth character?.

CodePudding user response：

First, define a split_chunk function

def split_chunk(txt, n=3):
    return [txt[i:i n] for i in range(0, len(txt), n)]

Then create a new dataframe from body using apply

>>> df2 = pd.DataFrame(df.body.apply(split_chunk).to_list())
>>> df2
     0    1     2
0  abc  def    gh
1  xyz    k  None

You can replace the None values, and rename the columns with the following

>>> df2 = df2.fillna("").rename(columns=lambda x: f"body{x 1}")
>>> df2
  body1 body2 body3
0   abc   def    gh
1   xyz     k

Finaly, restore the index

>>> df2.index = df.id
>>> df2
   body1 body2 body3
id
1    abc   def    gh
2    xyz     k

Shorter version

df = df.set_index("id")

df = pd.DataFrame(
         df.body.apply(split_chunk).to_list(), 
         index=df.index
     ).fillna("").rename(columns=lambda x: f"body{x 1}")

CodePudding user response：

Try this:

import pandas as pd
df = pd.DataFrame({"body": ["abcdefgh","xyzk"]})
df['body1'] = df['body'].astype(str).str[0:3]
df['body2'] = df['body'].astype(str).str[3:6]
df['body3'] = df['body'].astype(str).str[6:9]
df.drop('body',axis=1,inplace=True)
print(df)