I have two .csv
files.
'data.csv':
ID | start_sec | end_sec | labels |
---|---|---|---|
--PJHxphWEs | 30.000 | 40.000 | "/m/09x0r,/t/dd00088" |
--ZhevVpy1s | 50.000 | 60.000 | "/m/012xff" |
'labels.csv'(There are like 500 labels in this file, but to show you what it looks like):
index | labels | labels_en |
---|---|---|
0 | /m/09x0r | "Speech" |
1 | /m/05zppz | "Male speech, man speaking" |
I want to take the labels in 'data.csv', compare them against the labels in 'labels.csv' then save the corresponding labels_en
to a new column in my 'data.csv'.
As you can see, there can be more than one label that needs to be compared from 'data.csv' labels, but they are both within one string as shown by the first row of the table "/m/09x0r,/t/dd00088" -- that is two different values to find a corresponding labels_en for.
There is also some difficulty with labels not being encased in "", so the type of the data in 'data.csv' labels is a different type to the data in 'labels.csv' labels.
How do I go about doing this? I've tried a few different things in python but haven't been having any joy. Can anyone suggest how to do this? Or at least point me in the right direction.
The code I've tried so far:
import pandas as pd
df2 = pd.read_csv('labels.csv')
df1 = pd.read_csv('data.csv')
merged_df = pd.merge(df2, df1, on='labels')
for index, row in merged_df.iterrows():
labels = row['labels']
string = row['labels_en']
df2.loc[df2['labels'] == labels, 'labels'] = string
df2.to_csv('modified_csv.csv', index=False)
But I don't think it's even close yet, and the pd.merge()
doesn't work because the types of labels are different.
CodePudding user response:
For your problem, the best way to replace df1
's labels
with df2
's labels
could be, by making dictionary
of df2
's labels
& labels_en
column, and use them in replacing.
Here's the code to perform this:
import pandas as pd
df2 = pd.read_csv('labels.csv')
df1 = pd.read_csv('data.csv')
df=df2.drop(["index"],axis=1) # Removing unnecessary Column
di=dict(df.values) # {"/m/09x0r":"Speech",....}
def replace(x):
r=[]
for a in x.split(","):
b=di.get(a)
if b==None:
r.append(a)
else:
r.append(b)
return ",".join(r)
# Just replacing in same column
df1["labels"]=df1["labels"].apply(replace) # You may make new column
print(df1)
# Or same but one-liner solution is
df1["labels"]=df1["labels"].apply(lambda x:",".join([di.get(a) if di.get(a)!=None else a for a in x.split(",")]))
print(df1)
"""
Output:
ID start_sec end_sec labels
0 --PJHxphWEs 30.0 40.0 Speech,/t/dd00088
1 --ZhevVpy1s 50.0 60.0 /m/012xff
"""
What we are doing here is, first importing both data.csv & labels.csv file. Making dictionary using labels.csv
file. And replacing labels
of df1
which is separated by "," using dictionary.
CodePudding user response:
[Explode][1] your data_df
df_data = df_data.assign(labels2=df_data['labels'].str.split(',')).explode('labels2')
df_data
Out[279]:
ID start_sec end_sec labels labels2
0 --PJHxphWEs 30.0 40.0 /m/09x0r,/t/dd00088 /m/09x0r
0 --PJHxphWEs 30.0 40.0 /m/09x0r,/t/dd00088 /t/dd00088
1 --ZhevVpy1s 50.0 60.0 /m/012xff /m/012xff
Merge it with df_labels and get labels_en
df_data['labels_en']=df_data.merge(df_labels,left_on='labels2',right_on='labels')['labels_en']
df_data[['ID','start_sec','end_sec','labels','labels_en']]
Out[284]:
ID start_sec end_sec labels labels_en
0 --PJHxphWEs 30.0 40.0 /m/09x0r,/t/dd00088 Speech
0 --PJHxphWEs 30.0 40.0 /m/09x0r,/t/dd00088 Speech
1 --ZhevVpy1s 50.0 60.0 /m/012xff NaN