Replacing the values of one column based on other column-CodePudding

I have two .csv files.

'data.csv':

ID	start_sec	end_sec	labels
--PJHxphWEs	30.000	40.000	"/m/09x0r,/t/dd00088"
--ZhevVpy1s	50.000	60.000	"/m/012xff"

'labels.csv'(There are like 500 labels in this file, but to show you what it looks like):

index	labels	labels_en
0	/m/09x0r	"Speech"
1	/m/05zppz	"Male speech, man speaking"

I want to take the labels in 'data.csv', compare them against the labels in 'labels.csv' then save the corresponding labels_en to a new column in my 'data.csv'.

As you can see, there can be more than one label that needs to be compared from 'data.csv' labels, but they are both within one string as shown by the first row of the table "/m/09x0r,/t/dd00088" -- that is two different values to find a corresponding labels_en for.

There is also some difficulty with labels not being encased in "", so the type of the data in 'data.csv' labels is a different type to the data in 'labels.csv' labels.

How do I go about doing this? I've tried a few different things in python but haven't been having any joy. Can anyone suggest how to do this? Or at least point me in the right direction.

The code I've tried so far:

import pandas as pd

df2 = pd.read_csv('labels.csv')
df1 = pd.read_csv('data.csv')


merged_df = pd.merge(df2, df1, on='labels')

for index, row in merged_df.iterrows():
  labels = row['labels']
  string = row['labels_en']
  df2.loc[df2['labels'] == labels, 'labels'] = string

df2.to_csv('modified_csv.csv', index=False)

But I don't think it's even close yet, and the pd.merge() doesn't work because the types of labels are different.

CodePudding user response：

For your problem, the best way to replace df1's labels with df2's labels could be, by making dictionary of df2's labels & labels_en column, and use them in replacing.
Here's the code to perform this:

import pandas as pd

df2 = pd.read_csv('labels.csv')
df1 = pd.read_csv('data.csv')

df=df2.drop(["index"],axis=1) # Removing unnecessary Column
di=dict(df.values) # {"/m/09x0r":"Speech",....}

def replace(x):
    r=[]
    for a in x.split(","):
        b=di.get(a)
        if b==None:
            r.append(a)
        else:
            r.append(b)
    return ",".join(r)

# Just replacing in same column
df1["labels"]=df1["labels"].apply(replace) # You may make new column
print(df1)


# Or same but one-liner solution is
df1["labels"]=df1["labels"].apply(lambda x:",".join([di.get(a) if di.get(a)!=None else a for a in x.split(",")]))

print(df1)


"""
Output: 
            ID  start_sec  end_sec             labels
0  --PJHxphWEs       30.0     40.0  Speech,/t/dd00088
1  --ZhevVpy1s       50.0     60.0          /m/012xff
"""

What we are doing here is, first importing both data.csv & labels.csv file. Making dictionary using labels.csv file. And replacing labels of df1 which is separated by "," using dictionary.

CodePudding user response：

[Explode][1] your data_df

df_data = df_data.assign(labels2=df_data['labels'].str.split(',')).explode('labels2')
df_data
Out[279]: 
            ID  start_sec  end_sec               labels     labels2
0  --PJHxphWEs       30.0     40.0  /m/09x0r,/t/dd00088    /m/09x0r
0  --PJHxphWEs       30.0     40.0  /m/09x0r,/t/dd00088  /t/dd00088
1  --ZhevVpy1s       50.0     60.0            /m/012xff   /m/012xff

Merge it with df_labels and get labels_en

df_data['labels_en']=df_data.merge(df_labels,left_on='labels2',right_on='labels')['labels_en']
df_data[['ID','start_sec','end_sec','labels','labels_en']]
Out[284]: 
            ID  start_sec  end_sec               labels labels_en
0  --PJHxphWEs       30.0     40.0  /m/09x0r,/t/dd00088    Speech
0  --PJHxphWEs       30.0     40.0  /m/09x0r,/t/dd00088    Speech
1  --ZhevVpy1s       50.0     60.0            /m/012xff       NaN