My question is similar to this one and this one but I can't get their solutions to work for my problem.
I have a dataframe that looks like this:
study_id fuzzy_market
0 study1 [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
1 study2 [Country: Germany], [Management experience: Yes]
2 study3 [Country: United Kingdom], [Language: English]
3 study4 [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
4 study5 [Age: 48-99]
I'd like it to look like this:
study_id | Age | Country of Birth | Country | Language | Management experience |
---|---|---|---|---|---|
study1 | 18-67 | Austria, Germany | Austria, Germany | German | None |
study2 | None | None | Germany | None | Yes |
study3 | None | None | United Kingdom | English | None |
study4 | 18-67 | Austria, Germany | Austria, Germany | German | None |
study5 | 48-99 | None | None | None | None |
So one row per study_id
, the text before each colon in the fuzzy_market
column as the column title, and the text after each colon as the data in the cell. Where there is no relevant data for a column, I'd like to fill it with None
. All the columns can be strings. I don't know how many columns there will be, so I need this to be dynamic.
Here's the setup and data:
import pandas as pd
import numpy as np
import re
np.random.seed(12345)
df = pd.DataFrame.from_dict({'study_id': {0: 'study1',
1: 'study2',
2: 'study3',
3: 'study4',
4: 'study5'},
'fuzzy_market': {0: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
1: '[Country: Germany], [Management experience: Yes]',
2: '[Country: United Kingdom], [Language: English]',
3: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
4: '[Age: 48-99]'}})
So far I have tried manipulating the strings in the fuzzy_markets
column, but I don't think this approach is correct.
# a function to strip the square brackets, as I'm not sure this is really a list in here
def remove_square_brackets(x):
return re.sub(r"[\[\]]", "", x)
# make a new dataframe where there are new columns for data after every comma
df2 = df.join(df['fuzzy_market'].apply(remove_square_brackets).str.split(',', expand=True))
# rename the columns arbitrarily - these will need to be the question titles eventually e.g. Age rather than A, Country of Birth rather than B etc.
df2.columns = ('study_id', 'fuzzy_market', 'A', 'B', 'C', 'D', 'E', 'F')
# try and split again
df3 = df2[['study_id','A', 'B']].join(df2['A'].str.split(":", expand=True).rename(columns={0:'A1', 1:'A2'})).join(df2['B'].str.split(":", expand=True).rename(columns={0:'B1', 1:'B2'}))
# this isn't quite there yet
df3
study_id A B A1 A2 B1 B2
0 study1 Age: 18-67 Country of Birth: Austria Age 18-67 Country of Birth Austria
1 study2 Country: Germany Management experience: Yes Country Germany Management experience Yes
2 study3 Country: United Kingdom Language: English Country United Kingdom Language English
3 study4 Age: 18-67 Country of Birth: Austria Age 18-67 Country of Birth Austria
4 study5 Age: 48-99 None Age 48-99 None None
Thanks for any help or hints!
CodePudding user response:
We can use findall
to extract all the matching key-value pairs from each row, then map
these pairs to to dict
and create a dataframe
p = df['fuzzy_market'].str.findall(r'([^:\[] ): ([^\]] )')
df[['study_id']].join(pd.DataFrame(map(dict, p)))
study_id Age Country of Birth Country Language Management experience
0 study1 18-67 Austria, Germany Austria, Germany German NaN
1 study2 NaN NaN Germany NaN Yes
2 study3 NaN NaN United Kingdom English NaN
3 study4 18-67 Austria, Germany Austria, Germany German NaN
4 study5 48-99 NaN NaN NaN NaN