I have a pandas dataframe with a structure like:
df =
repl_str | normal_str |
---|---|
1_labelled | 1_text |
2_labelled | 2_text |
4_labelled | 4_text |
5_labelled | 5_text |
7_labelled | 7_text |
8_labelled | 8_text |
And a list of lists where some of the strings in df["normal_str"] are present, but not necessarily all, like:
A = [[1_text, 3_text, 4_text], [5_text], [6_text, 8_text]]
I want to create a new list of lists B, where the string elements present in df and A are exchanged for the corresponding string in the "labelled_str" column of df. The strings in A which are not present in df["normal_str"] should be left as is.
So in this case: B = [[1_labelled, 3_text, 4_labelled], [5_labelled], [6_text, 8_labelled]].
In the actual list of lists (instead of this mock example), the inner lists greatly vary in length. I have a working solution using list comprehension, but it takes a long time to run:
[[[str_val for str_val in df['repl_str'].where(df['normal_str']==y).tolist() if str_val==str_val][0]
if [str_val for str_val in df['repl_str'].where(df['normal_str']==y).tolist() if str_val == str_val]
else y for y in x] for x in A]
Does anyone know a quicker way?
CodePudding user response:
If values in normal_str
column are all unique, you can create a dictionary that maps normal_str
column to repl_str
column
A = [['1_text', '3_text', '4_text'], ['5_text'], ['6_text', '8_text']]
d = df.set_index(['normal_str'])['repl_str'].to_dict()
B = [[d.get(text, text) for text in lst] for lst in A]
print(B)
[['1_labelled', '3_text', '4_labelled'], ['5_labelled'], ['6_text', '8_labelled']]