Say we have this df:
df = pd.DataFrame({
'string': ['blue',
'blue,red',
'red',
'purple',
'blue,red,green,yellow,magenta,purple',
'',
'yellow,purple']})
Say we want to split these strings at each comma and put them in new columns. As found here (How can I separate text into multiple values in a CSV file using Python?) and here (Split one column into multiple columns by multiple delimiters in Pandas), we can use str.split
so:
df[['blue', 'red', 'green', 'yellow', 'magenta', 'purple']] = df['string'].str.split(',',expand=True)
df
Result:
string blue red green yellow magenta purple
0 blue blue None None None None None
1 blue,red blue red None None None None
2 red red None None None None None
3 purple purple None None None None None
4 blue,red,green,yellow,magenta,purple blue red green yellow magenta purple
5 None None None None None
6 yellow,purple yellow purple None None None None
Indices 0, 1, and 4 work as I want, with the colors correctly categorised in the new columns.
For the other indices, colors are misclassified. Note the examples linked above do not solve my issue since they have no missing data points/levels. How do I fix this? (And also, add a 'None' for index 5, column 'blue'?)
Many thanks
CodePudding user response:
df1 = df.join(df['string'].str.get_dummies(','))
print (df1)
string blue green magenta purple red \
0 blue 1 0 0 0 0
1 blue,red 1 0 0 0 1
2 red 0 0 0 0 1
3 purple 0 0 0 1 0
4 blue,red,green,yellow,magenta,purple 1 1 1 1 1
5 0 0 0 0 0
6 yellow,purple 0 0 0 1 0
yellow
0 0
1 0
2 0
3 0
4 1
5 0
6 1
If need also values instead 1
and None
instead 0
add:
df1 = df['string'].str.get_dummies(',')
df = df.join(pd.DataFrame(np.where(df1, df1.columns.to_series(),None),
index=df1.index,
columns=df1.columns))
print (df)
string blue green magenta purple red \
0 blue blue None None None None
1 blue,red blue None None None red
2 red None None None None red
3 purple None None None purple None
4 blue,red,green,yellow,magenta,purple blue green magenta purple red
5 None None None None None
6 yellow,purple None None None purple None
yellow
0 None
1 None
2 None
3 None
4 yellow
5 None
6 yellow
CodePudding user response:
I would do it following way
import pandas as pd
df = pd.DataFrame({
'string': ['blue',
'blue,red',
'red',
'purple',
'blue,red,green,yellow,magenta,purple',
'',
'yellow,purple']})
def convert_to_dict(string):
return dict((i,i) for i in string.split(",") if i)
df2 = df['string'].apply(convert_to_dict).apply(pd.Series)
finaldf = pd.concat([df,df2],axis=1)
print(finaldf)
output
string blue red purple green yellow magenta
0 blue blue NaN NaN NaN NaN NaN
1 blue,red blue red NaN NaN NaN NaN
2 red NaN red NaN NaN NaN NaN
3 purple NaN NaN purple NaN NaN NaN
4 blue,red,green,yellow,magenta,purple blue red purple green yellow magenta
5 NaN NaN NaN NaN NaN NaN
6 yellow,purple NaN NaN purple NaN yellow NaN
Explanation: Important part is converting string to dict
so for each key-value pair key is equal value. Then I convert (single pandas.Series
holding dict
s) into pandas.DataFrame
and pandas.concat
it with original pandas.DataFrame
.
CodePudding user response:
import pandas as pd
from typing import List
from pandas import DataFrame
# your input
df = pd.DataFrame({
'string': ['blue',
'blue,red',
'red',
'purple',
'blue,red,green,yellow,magenta,purple',
'',
'yellow,purple']})
# define columns
input_col = "string"
encoded_col = "string_enriched"
# unique colors
unique_colors: List[str] = sorted([x for x in set(sum([x.split(",") for x in df[input_col].tolist()], [])) if x])
# create column with unique colors encoded
df[encoded_col] = df[input_col].apply(lambda x: (1 if w in x.split(",") else 0 for w in unique_colors))
# enrich encoded column swapping 0,1 to colors
df[encoded_col] = df.apply(lambda row: (w[1] if w[0] else "" for w in zip(row[encoded_col], unique_colors)), axis=1)
# explode column into multiple columns
dx = pd.DataFrame(df[encoded_col].to_list(), columns = unique_colors)
# concat two dataframes horizontaly
df_out: DataFrame = pd.concat([df[input_col], dx], axis=1)