I have a column in a pandas data frame where one of the columns is an array of strings as shown below.
|column1 |
|:--------------------------------------------------------|
|['abc<t>def<t>ghi', 'jkl<t>mno<t>pqr'] |
|['abc<t>def<t>ghi', 'jkl<t>mno<t>pqr', 'def<t>pqr<t>jkl']|
|['ghi<t>jkl<t>pqr'] |
I need to split the column into an array of arrays such that the output looks like the table below
|column2 |
|:-------------------------------------------------------------------|
|[['abc', 'def', 'ghi'], ['jkl', 'mno', 'pqr']] |
|['abc', 'def', 'ghi'], ['jkl', 'mno', 'pqr'], ['def', 'pqr', 'jkl']]|
|[['ghi', 'jkl', 'pqr']] |
I have tried using split as shown below but this returns not a number for all values
dataset["column1"].str.split("<t>")
CodePudding user response:
Solution:
dataset['column1'].apply(lambda x: [i.split("<t>") for i in x])
Explanation:
apply(...)
applies the lambda function to each element in the series dataset['column1']
. The lambda function performs splitting (.split("<t>")
) for each element in the list.
CodePudding user response:
You can try this code
Basically split the data based on <t>
and then convert the elements to string to get the required apostrophes.
Here method format
does these operations.
import pandas as pd
df = pd.DataFrame([[['abc<t>def<t>ghi', 'jkl<t>mno<t>pqr']], [['abc<t>def<t>ghi', 'jkl<t>mno<t>pqr', 'def<t>pqr<t>jkl']]], columns=['name'])
def format(value):
return [str(item) for item in [i.split('<t>') for i in value]]
df['new_name'] = df['name'].apply(lambda x: format(x) )
print(df)
Output:
name new_name
0 [abc<t>def<t>ghi, jkl<t>mno<t>pqr] [['abc', 'def', 'ghi'], ['jkl', 'mno', 'pqr']]
1 [abc<t>def<t>ghi, jkl<t>mno<t>pqr, def<t>pqr<t... [['abc', 'def', 'ghi'], ['jkl', 'mno', 'pqr'],...