I require some guidance on how to extract multi-values from one cell to match to another cell within the same row. This will be applied to the whole dataframe and a new dataframe is produced.
Source data
A sample of my source data is below. I can't change the source as it is available in such format and I will have it given to me on a monthly basis.
Data source:
data = {'C1': ['Alpha, Charlie, Bravo','Beta, Gamma','Foxtrot, Delta'],
'C2': ['Alpha, San Francisco, US ; Charlie, New York, US ;
Bravo, London, UK', 'Beta, Singapore, Singapore ;
Gamma, Tokyo, Japan','[Foxtrot; Delta], Seoul, South Korea'],}
df = pd.DataFrame(data)
C1 | C2 |
---|---|
Alpha, Charlie, Bravo | Alpha, San Francisco, US ; Charlie, New York, US ; Bravo, London, UK |
Beta, Gamma | Beta, Singapore, Singapore ; Gamma, Tokyo, Japan |
Foxtrot, Delta | [Foxtrot; Delta], Seoul, South Korea |
Intended outcome as new dataframe
I wish to achieve the following result in a new dataframe:
data2 = {'C1': ['Alpha', 'Charlie', 'Bravo','Beta', 'Gamma','Foxtrot', 'Delta'],
'C2': ['Alpha, San Francisco, US', 'Charlie, New York, US',
'Bravo, London, UK', 'Beta, Singapore, Singapore',
'Gamma, Tokyo, Japan','[Foxtrot; Delta], Seoul, South Korea',
'[Foxtrot; Delta], Seoul, South Korea'],}
new_df = pd.DataFrame(data2)
N1 | N2 |
---|---|
Alpha | Alpha, San Francisco, US |
Charlie | Charlie, New York, US |
Bravo | Bravo, London, UK |
Beta | Beta, Singapore, Singapore |
Gamma | Gamma, Tokyo, Japan |
Foxtrot | [Foxtrot; Delta], Seoul, South Korea |
Delta | [Foxtrot; Delta], Seoul, South Korea |
My actual result (failed attempt)
I have tried nested for loops where C1 will be the outer loop and C3 will be the inner loop. I get the following results:
data3 = {'C1': ['Alpha', 'Alpha', 'Alpha', 'Charlie', 'Charlie',
'Bravo','Beta', 'Beta', 'Gamma','Foxtrot', 'Delta'],
'C2': ['Alpha, San Francisco, US', 'Alpha, New York, US', 'Alpha, London, UK',
'Charlie, New York, US', 'Charlie, London, UK', 'Bravo, London, UK',
'Beta, Singapore, Singapore', 'Beta, Tokyo, Japan',
'Gamma, Tokyo, Japan',
[Foxtrot; Delta], Seoul, South Korea,
[Foxtrot; Delta], Seoul, South Korea],}
result_df = pd.DataFrame(data3)
N1 | N2 |
---|---|
Alpha | Alpha, San Francisco, US |
Alpha | Charlie, New York, US |
Alpha | Bravo, London, UK |
Charlie | Charlie, New York, US |
Charlie | Charlie, London, UK |
Bravo | Bravo, London, UK |
Beta | Beta, Singapore, Singapore |
Beta | Gamma, Tokyo, Japan |
Gamma | Gamma, Tokyo, Japan |
Foxtrot | [Foxtrot; Delta], Seoul, South Korea |
Delta | [Foxtrot; Delta], Seoul, South Korea |
The script that I used
The script I have is below.
new_list = []
for idx, vals in enumerate(df['C1']):
vals = vals.split('; ')
locs = (df['C2'].values[idx]).split(' ; ')
for val in vals:
for loc in locs:
new_list.append((idx, val, loc))
new_df = pd.DataFrame(data=new_list, columns=['N1', 'N2'])
Any help will really be much appreciated. Thanks.
CodePudding user response:
There are two quirks in your data.
- The presence of a semi-colon inside the square bracket in some of the rows, which makes splitting my
;
a bad idea - Unequal number of values to map the two columns
Try this:
import re
def replace_semi_colon(s):
return re.sub(r'\[(.*?)\]', lambda x: '[' x.group(1).replace(';', ':') ']', s)
# temporarily replace the semi-colon inside square brackets with colon
df['C2'] = df['C2'].apply(lambda x: replace_semi_colon(x))
# now split the values
df['C1'] = df['C1'].str.split(',')
df['C2'] = df['C2'].str.split(';')
# make the number of values in each column equal
df['C2'] = df.apply(lambda x : x['C2'] * len(x['C1']) if len(x['C1'])>len(x['C2']) else x['C2'] , axis =1)
df = df.explode(['C1','C2'])
# replace the ':' back to ';'
df['C2'] = df['C2'].apply(lambda x: x.replace(':',';'))
The result it gives:
C1 C2
0 Alpha Alpha, San Francisco, US
0 Charlie Charlie, New York, US
0 Bravo Bravo, London, UK
1 Beta Beta, Singapore, Singapore
1 Gamma Gamma, Tokyo, Japan
2 Foxtrot [Foxtrot; Delta], Seoul, South Korea
2 Delta [Foxtrot; Delta], Seoul, South Korea