Home > Software design >  Split the column of dataframe into multiple columns according to the text length of column on Python
Split the column of dataframe into multiple columns according to the text length of column on Python

Time:11-03

How can I separate a column of pandas dataframe into multiple columns based on the size of each text length? Assume that chunk size will be 3 and the sample dataframe is:

id body
1 abcdefgh
2 xyzk

For this case, I would like to get:

id body1 body2 body3
1  abc  def gh
2  xyz  k

I am assuming that I should be able to handle it with something like : df[['body1','body2', 'body3']] = df['body'].str.split(...

Any suggestions?

CodePudding user response:

You can do the following:

new_values = df['body'].str.findall('.{1,3}')
new_columns = [f'body{num}' for num in range(1, new_values.apply(len).max()  1)]
new_df = pd.DataFrame(data=new_values.tolist(), columns=new_columns)

You can also define your regex pattern based on the maximum number of characters you want on each column:

max_char_per_column = 3
regex_pattern = f".{{1,{max_char_per_column}}}"
new_values = df['body'].str.findall(regex_pattern)

If you don't want the None, feel free to .fillna("") your new_df.

See this answer for splitting a string with regex every nth character Split string every nth character?.

CodePudding user response:

First, define a split_chunk function

def split_chunk(txt, n=3):
    return [txt[i:i n] for i in range(0, len(txt), n)]

Then create a new dataframe from body using apply

>>> df2 = pd.DataFrame(df.body.apply(split_chunk).to_list())
>>> df2
     0    1     2
0  abc  def    gh
1  xyz    k  None

You can replace the None values, and rename the columns with the following

>>> df2 = df2.fillna("").rename(columns=lambda x: f"body{x 1}")
>>> df2
  body1 body2 body3
0   abc   def    gh
1   xyz     k

Finaly, restore the index

>>> df2.index = df.id
>>> df2
   body1 body2 body3
id
1    abc   def    gh
2    xyz     k

Shorter version

df = df.set_index("id")

df = pd.DataFrame(
         df.body.apply(split_chunk).to_list(), 
         index=df.index
     ).fillna("").rename(columns=lambda x: f"body{x 1}")

CodePudding user response:

Try this:

import pandas as pd
df = pd.DataFrame({"body": ["abcdefgh","xyzk"]})
df['body1'] = df['body'].astype(str).str[0:3]
df['body2'] = df['body'].astype(str).str[3:6]
df['body3'] = df['body'].astype(str).str[6:9]
df.drop('body',axis=1,inplace=True)
print(df)
  • Related