I have the following dataframe named marketing where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data
source=book,social_media=facebook,ads=Facebook
source=book,ads=Facebook,customer=2
cost=2, customer=3
Im using python and i have tried the following:
df = pd.DataFrame()
def find_keywords(row_string):
tags = [x for x in row_string if x.startswith('source=')]
return tags
df['Data'] = marketing['Data'].apply(lambda row : find_keywords(row))
May i know whether there is a more efficient way to extract and place into columns:
source social_media ads customer costs
book facebook facebook - -
book - facebook 2 -
CodePudding user response:
You can split the column value of string type into dict then use pd.json_normalize
to convert dict to columns.
out = pd.json_normalize(marketing['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.split(',')]))).dropna(subset='source')
print(out)
source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN
CodePudding user response:
Here's another option:
Sample dataframe marketing
is:
marketing = pd.DataFrame(
{"Data": ["source=book,social_media=facebook,ads=Facebook",
"source=book,ads=Facebook,customer=2",
"cost=2, customer=3"]}
)
Data
0 source=book,social_media=facebook,ads=Facebook
1 source=book,ads=Facebook,customer=2
2 cost=2, customer=3
Now this
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).pivot(columns=0))
does produce
1
0 ads cost customer social_media source
0 Facebook NaN NaN facebook book
1 Facebook NaN 2 NaN book
2 NaN 2 3 NaN NaN
which is almost what you're looking for, except for the extra column level and the column ordering. So the following modification
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).rename(columns={0: "columns"})
.pivot(columns="columns").droplevel(level=0, axis=1))
result = result[["source", "social_media", "ads", "customer", "cost"]]
should fix that:
columns source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN
2 NaN NaN NaN 3 2