Map values from 2 separate cells in a row using Python-CodePudding

I require some guidance on how to extract multi-values from one cell to match to another cell within the same row. This will be applied to the whole dataframe and a new dataframe is produced.

Source data

A sample of my source data is below. I can't change the source as it is available in such format and I will have it given to me on a monthly basis.

Data source: 
data = {'C1': ['Alpha, Charlie, Bravo','Beta, Gamma','Foxtrot, Delta'],
        'C2': ['Alpha, San Francisco, US ; Charlie, New York, US ; 
              Bravo, London, UK', 'Beta, Singapore, Singapore ; 
              Gamma, Tokyo, Japan','[Foxtrot; Delta], Seoul, South Korea'],}

df = pd.DataFrame(data)

C1	C2
Alpha, Charlie, Bravo	Alpha, San Francisco, US ; Charlie, New York, US ; Bravo, London, UK
Beta, Gamma	Beta, Singapore, Singapore ; Gamma, Tokyo, Japan
Foxtrot, Delta	[Foxtrot; Delta], Seoul, South Korea

Intended outcome as new dataframe

I wish to achieve the following result in a new dataframe:

data2 = {'C1': ['Alpha', 'Charlie', 'Bravo','Beta', 'Gamma','Foxtrot', 'Delta'],
        'C2': ['Alpha, San Francisco, US', 'Charlie, New York, US', 
               'Bravo, London, UK', 'Beta, Singapore, Singapore', 
               'Gamma, Tokyo, Japan','[Foxtrot; Delta], Seoul, South Korea',
               '[Foxtrot; Delta], Seoul, South Korea'],}

new_df = pd.DataFrame(data2)

N1	N2
Alpha	Alpha, San Francisco, US
Charlie	Charlie, New York, US
Bravo	Bravo, London, UK
Beta	Beta, Singapore, Singapore
Gamma	Gamma, Tokyo, Japan
Foxtrot	[Foxtrot; Delta], Seoul, South Korea
Delta	[Foxtrot; Delta], Seoul, South Korea

My actual result (failed attempt)

I have tried nested for loops where C1 will be the outer loop and C3 will be the inner loop. I get the following results:

data3 = {'C1': ['Alpha', 'Alpha', 'Alpha', 'Charlie', 'Charlie', 
                'Bravo','Beta', 'Beta', 'Gamma','Foxtrot', 'Delta'],
        'C2': ['Alpha, San Francisco, US', 'Alpha, New York, US', 'Alpha, London, UK', 
               'Charlie, New York, US', 'Charlie, London, UK', 'Bravo, London, UK',
               'Beta, Singapore, Singapore', 'Beta, Tokyo, Japan', 
               'Gamma, Tokyo, Japan',
               [Foxtrot; Delta], Seoul, South Korea,
               [Foxtrot; Delta], Seoul, South Korea],}

result_df = pd.DataFrame(data3)

N1	N2
Alpha	Alpha, San Francisco, US
Alpha	Charlie, New York, US
Alpha	Bravo, London, UK
Charlie	Charlie, New York, US
Charlie	Charlie, London, UK
Bravo	Bravo, London, UK
Beta	Beta, Singapore, Singapore
Beta	Gamma, Tokyo, Japan
Gamma	Gamma, Tokyo, Japan
Foxtrot	[Foxtrot; Delta], Seoul, South Korea
Delta	[Foxtrot; Delta], Seoul, South Korea

The script that I used

The script I have is below.

new_list = []
for idx, vals in enumerate(df['C1']):
    vals = vals.split('; ')
    locs = (df['C2'].values[idx]).split(' ; ')
    for val in vals:
        for loc in locs:
            new_list.append((idx, val, loc))

new_df = pd.DataFrame(data=new_list, columns=['N1', 'N2'])

Any help will really be much appreciated. Thanks.

CodePudding user response：

There are two quirks in your data.

The presence of a semi-colon inside the square bracket in some of the rows, which makes splitting my ; a bad idea
Unequal number of values to map the two columns

Try this:

import re
def replace_semi_colon(s):
    return re.sub(r'\[(.*?)\]', lambda x: '['   x.group(1).replace(';', ':')   ']', s)

# temporarily replace the semi-colon inside square brackets with colon
df['C2'] = df['C2'].apply(lambda x: replace_semi_colon(x))

# now split the values 
df['C1'] = df['C1'].str.split(',')
df['C2'] = df['C2'].str.split(';')

# make the number of values in each column equal
df['C2'] = df.apply(lambda x : x['C2'] * len(x['C1']) if len(x['C1'])>len(x['C2']) else x['C2'] , axis =1)
df = df.explode(['C1','C2'])

# replace the ':' back  to  ';'
df['C2'] = df['C2'].apply(lambda x: x.replace(':',';'))

The result it gives:


    C1      C2
0   Alpha   Alpha, San Francisco, US
0   Charlie Charlie, New York, US
0   Bravo   Bravo, London, UK
1   Beta    Beta, Singapore, Singapore
1   Gamma   Gamma, Tokyo, Japan
2   Foxtrot [Foxtrot; Delta], Seoul, South Korea
2   Delta   [Foxtrot; Delta], Seoul, South Korea