I have a Pandas dataframe such that df['cname']
:
0 [berkshire, hathaway]
1 [icbc]
2 [saudi, arabian, oil, company, saudi, aramco]
3 [jpmorgan, chase]
4 [china, construction, bank]
Name: tokenized_company_name, dtype: object
and another Pandas dataframe such that tfidf['output']
:
[0.7071067811865476, 0.7071067811865476]
[1.0]
[0.3779598156018814, 0.39838548612653973, 0.39838548612653973, 0.3285496573358837, 0.6570993146717674]
[0.7071067811865476, 0.7071067811865476]
[0.4225972188244829, 0.510750779645552, 0.7486956870005814]
I'm trying to sort each list of tokens in f_sp['tokenized_company_name']
by tfidf['output_column']
such that I get:
0 [berkshire, hathaway] # no difference
1 [icbc] # no difference
2 [aramco, arabian, oil, saudi, company] # re-ordered by decreasing value of tf_sp['output_column']
3 [chase, jpmorgan] # tied elements should be ordered alphabetically
4 [bank, construction, china] # re-ordered by decreasing value of tf_sp['output_column']
Here's what I've tried so far:
(f_sp.apply(lambda x: sorted(x['tokenized_company_name'],
key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)],
reverse=True), axis=1))
But I get the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [166], in <cell line: 1>()
----> 1 (f_sp.apply(lambda x: sorted(x['tokenized_company_name'],
2 key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)],
3 reverse=True), axis=1))
File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\frame.py:9555, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
9544 from pandas.core.apply import frame_apply
9546 op = frame_apply(
9547 self,
9548 func=func,
(...)
9553 kwargs=kwargs,
9554 )
-> 9555 return op.apply().__finalize__(self, method="apply")
File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\apply.py:746, in FrameApply.apply(self)
743 elif self.raw:
744 return self.apply_raw()
--> 746 return self.apply_standard()
File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\apply.py:873, in FrameApply.apply_standard(self)
872 def apply_standard(self):
--> 873 results, res_index = self.apply_series_generator()
875 # wrap results
876 return self.wrap_results(results, res_index)
File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\apply.py:889, in FrameApply.apply_series_generator(self)
886 with option_context("mode.chained_assignment", None):
887 for i, v in enumerate(series_gen):
888 # ignore SettingWithCopy here in case the user mutates
--> 889 results[i] = self.f(v)
890 if isinstance(results[i], ABCSeries):
891 # If we have a view on v, we need to make a copy because
892 # series_generator will swap out the underlying data
893 results[i] = results[i].copy(deep=False)
Input In [166], in <lambda>(x)
----> 1 (f_sp.apply(lambda x: sorted(x['tokenized_company_name'],
2 key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)],
3 reverse=True), axis=1))
Input In [166], in <lambda>.<locals>.<lambda>(y)
1 (f_sp.apply(lambda x: sorted(x['tokenized_company_name'],
----> 2 key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)],
3 reverse=True), axis=1))
IndexError: list index out of range
Why is this happening? Each list of lists has the same number of elements.
CodePudding user response:
To sort the list of tokens in f_sp['tokenized_company_name']
by the corresponding value in tf_sp['output_column']
, you can use the zip
function to combine the two columns and then sort the resulting list of tuples by the value of the second element in each tuple (which is the corresponding value from tf_sp['output_column']
). You can then extract only the first element of each tuple (which is the token) to obtain the sorted list of tokens.
Here is an example of how you can achieve this using a lambda function with the apply method of f_sp:
f_sp['tokenized_company_name'] = f_sp.apply(lambda x: [t[0] for t in sorted(zip(x['tokenized_company_name'], tf_sp.loc[x.name, 'output_column']), key=lambda t: t[1], reverse=True)], axis=1)
This will sort the list of tokens in f_sp['tokenized_company_name']
by the corresponding value in tf_sp['output_column']
and store the sorted list back in f_sp['tokenized_company_name']
.
Note that this solution assumes that the length of f_sp['tokenized_company_name']
and tf_sp['output_column']
is the same for each row in f_sp
. Otherwise, you may need to handle the case where the length of the two columns is different.
CodePudding user response:
To order a list of lists of strings by another list of lists of floats in Pandas, you can use the "sort_values" method. Here is an example:
import pandas as pd
# create dataframe with string lists as data
df = pd.DataFrame({'strings': [['apple', 'banana', 'cherry'],
['dog', 'cat', 'bird'],
['red', 'green', 'blue']]})
# create dataframe with float lists as data
df_floats = pd.DataFrame({'floats': [[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0]]})
# sort the string dataframe by the float dataframe
df.sort_values(by=df_floats['floats'])
This will return a new dataframe with the strings in each list sorted according to the corresponding list of floats.