How can I separate a column of pandas dataframe into multiple columns based on the size of each text length? Assume that chunk size will be 3 and the sample dataframe is:
id body
1 abcdefgh
2 xyzk
For this case, I would like to get:
id body1 body2 body3
1 abc def gh
2 xyz k
I am assuming that I should be able to handle it with something like : df[['body1','body2', 'body3']] = df['body'].str.split(...
Any suggestions?
CodePudding user response:
You can do the following:
new_values = df['body'].str.findall('.{1,3}')
new_columns = [f'body{num}' for num in range(1, new_values.apply(len).max() 1)]
new_df = pd.DataFrame(data=new_values.tolist(), columns=new_columns)
You can also define your regex pattern based on the maximum number of characters you want on each column:
max_char_per_column = 3
regex_pattern = f".{{1,{max_char_per_column}}}"
new_values = df['body'].str.findall(regex_pattern)
If you don't want the None, feel free to .fillna("")
your new_df
.
See this answer for splitting a string with regex every nth character Split string every nth character?.
CodePudding user response:
First, define a split_chunk
function
def split_chunk(txt, n=3):
return [txt[i:i n] for i in range(0, len(txt), n)]
Then create a new dataframe from body
using apply
>>> df2 = pd.DataFrame(df.body.apply(split_chunk).to_list())
>>> df2
0 1 2
0 abc def gh
1 xyz k None
You can replace the None values, and rename the columns with the following
>>> df2 = df2.fillna("").rename(columns=lambda x: f"body{x 1}")
>>> df2
body1 body2 body3
0 abc def gh
1 xyz k
Finaly, restore the index
>>> df2.index = df.id
>>> df2
body1 body2 body3
id
1 abc def gh
2 xyz k
Shorter version
df = df.set_index("id")
df = pd.DataFrame(
df.body.apply(split_chunk).to_list(),
index=df.index
).fillna("").rename(columns=lambda x: f"body{x 1}")
CodePudding user response:
Try this:
import pandas as pd
df = pd.DataFrame({"body": ["abcdefgh","xyzk"]})
df['body1'] = df['body'].astype(str).str[0:3]
df['body2'] = df['body'].astype(str).str[3:6]
df['body3'] = df['body'].astype(str).str[6:9]
df.drop('body',axis=1,inplace=True)
print(df)