Home > database >  Map values from 2 separate cells in a row using Python
Map values from 2 separate cells in a row using Python

Time:10-18

I require some guidance on how to extract multi-values from one cell to match to another cell within the same row. This will be applied to the whole dataframe and a new dataframe is produced.

Source data

A sample of my source data is below. I can't change the source as it is available in such format and I will have it given to me on a monthly basis.

Data source: 
data = {'C1': ['Alpha, Charlie, Bravo','Beta, Gamma','Foxtrot, Delta'],
        'C2': ['Alpha, San Francisco, US ; Charlie, New York, US ; 
              Bravo, London, UK', 'Beta, Singapore, Singapore ; 
              Gamma, Tokyo, Japan','[Foxtrot; Delta], Seoul, South Korea'],}

df = pd.DataFrame(data)
C1 C2
Alpha, Charlie, Bravo Alpha, San Francisco, US ; Charlie, New York, US ; Bravo, London, UK
Beta, Gamma Beta, Singapore, Singapore ; Gamma, Tokyo, Japan
Foxtrot, Delta [Foxtrot; Delta], Seoul, South Korea

Intended outcome as new dataframe

I wish to achieve the following result in a new dataframe:

data2 = {'C1': ['Alpha', 'Charlie', 'Bravo','Beta', 'Gamma','Foxtrot', 'Delta'],
        'C2': ['Alpha, San Francisco, US', 'Charlie, New York, US', 
               'Bravo, London, UK', 'Beta, Singapore, Singapore', 
               'Gamma, Tokyo, Japan','[Foxtrot; Delta], Seoul, South Korea',
               '[Foxtrot; Delta], Seoul, South Korea'],}

new_df = pd.DataFrame(data2)
N1 N2
Alpha Alpha, San Francisco, US
Charlie Charlie, New York, US
Bravo Bravo, London, UK
Beta Beta, Singapore, Singapore
Gamma Gamma, Tokyo, Japan
Foxtrot [Foxtrot; Delta], Seoul, South Korea
Delta [Foxtrot; Delta], Seoul, South Korea

My actual result (failed attempt)

I have tried nested for loops where C1 will be the outer loop and C3 will be the inner loop. I get the following results:

data3 = {'C1': ['Alpha', 'Alpha', 'Alpha', 'Charlie', 'Charlie', 
                'Bravo','Beta', 'Beta', 'Gamma','Foxtrot', 'Delta'],
        'C2': ['Alpha, San Francisco, US', 'Alpha, New York, US', 'Alpha, London, UK', 
               'Charlie, New York, US', 'Charlie, London, UK', 'Bravo, London, UK',
               'Beta, Singapore, Singapore', 'Beta, Tokyo, Japan', 
               'Gamma, Tokyo, Japan',
               [Foxtrot; Delta], Seoul, South Korea,
               [Foxtrot; Delta], Seoul, South Korea],}

result_df = pd.DataFrame(data3)
N1 N2
Alpha Alpha, San Francisco, US
Alpha Charlie, New York, US
Alpha Bravo, London, UK
Charlie Charlie, New York, US
Charlie Charlie, London, UK
Bravo Bravo, London, UK
Beta Beta, Singapore, Singapore
Beta Gamma, Tokyo, Japan
Gamma Gamma, Tokyo, Japan
Foxtrot [Foxtrot; Delta], Seoul, South Korea
Delta [Foxtrot; Delta], Seoul, South Korea

The script that I used

The script I have is below.

new_list = []
for idx, vals in enumerate(df['C1']):
    vals = vals.split('; ')
    locs = (df['C2'].values[idx]).split(' ; ')
    for val in vals:
        for loc in locs:
            new_list.append((idx, val, loc))

new_df = pd.DataFrame(data=new_list, columns=['N1', 'N2'])

Any help will really be much appreciated. Thanks.

CodePudding user response:

There are two quirks in your data.

  1. The presence of a semi-colon inside the square bracket in some of the rows, which makes splitting my ; a bad idea
  2. Unequal number of values to map the two columns

Try this:

import re
def replace_semi_colon(s):
    return re.sub(r'\[(.*?)\]', lambda x: '['   x.group(1).replace(';', ':')   ']', s)

# temporarily replace the semi-colon inside square brackets with colon
df['C2'] = df['C2'].apply(lambda x: replace_semi_colon(x))

# now split the values 
df['C1'] = df['C1'].str.split(',')
df['C2'] = df['C2'].str.split(';')

# make the number of values in each column equal
df['C2'] = df.apply(lambda x : x['C2'] * len(x['C1']) if len(x['C1'])>len(x['C2']) else x['C2'] , axis =1)
df = df.explode(['C1','C2'])

# replace the ':' back  to  ';'
df['C2'] = df['C2'].apply(lambda x: x.replace(':',';'))

The result it gives:


    C1      C2
0   Alpha   Alpha, San Francisco, US
0   Charlie Charlie, New York, US
0   Bravo   Bravo, London, UK
1   Beta    Beta, Singapore, Singapore
1   Gamma   Gamma, Tokyo, Japan
2   Foxtrot [Foxtrot; Delta], Seoul, South Korea
2   Delta   [Foxtrot; Delta], Seoul, South Korea
  • Related