Home > Software design >  How to order list of lists of strings by another list of lists of floats in Pandas
How to order list of lists of strings by another list of lists of floats in Pandas

Time:12-04

I have a Pandas dataframe such that df['cname']:

0                            [berkshire, hathaway]
1                                           [icbc]
2    [saudi, arabian, oil, company, saudi, aramco]
3                                [jpmorgan, chase]
4                      [china, construction, bank]
Name: tokenized_company_name, dtype: object

and another Pandas dataframe such that tfidf['output']:

[0.7071067811865476, 0.7071067811865476]
[1.0]
[0.3779598156018814, 0.39838548612653973, 0.39838548612653973, 0.3285496573358837, 0.6570993146717674]
[0.7071067811865476, 0.7071067811865476]
[0.4225972188244829, 0.510750779645552, 0.7486956870005814]

I'm trying to sort each list of tokens in f_sp['tokenized_company_name'] by tfidf['output_column'] such that I get:

0                            [berkshire, hathaway] # no difference
1                                           [icbc] # no difference
2           [aramco, arabian, oil, saudi, company] # re-ordered by decreasing value of tf_sp['output_column']
3                                [chase, jpmorgan] # tied elements should be ordered alphabetically
4                      [bank, construction, china] # re-ordered by decreasing value of tf_sp['output_column']

Here's what I've tried so far:

(f_sp.apply(lambda x: sorted(x['tokenized_company_name'], 
           key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)], 
                                reverse=True), axis=1))

But I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [166], in <cell line: 1>()
----> 1 (f_sp.apply(lambda x: sorted(x['tokenized_company_name'], 
      2            key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)], 
      3                                 reverse=True), axis=1))

File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\frame.py:9555, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   9544 from pandas.core.apply import frame_apply
   9546 op = frame_apply(
   9547     self,
   9548     func=func,
   (...)
   9553     kwargs=kwargs,
   9554 )
-> 9555 return op.apply().__finalize__(self, method="apply")

File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\apply.py:746, in FrameApply.apply(self)
    743 elif self.raw:
    744     return self.apply_raw()
--> 746 return self.apply_standard()

File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\apply.py:873, in FrameApply.apply_standard(self)
    872 def apply_standard(self):
--> 873     results, res_index = self.apply_series_generator()
    875     # wrap results
    876     return self.wrap_results(results, res_index)

File ~\.conda\envs\python37dev\lib\site-packages\pandas\core\apply.py:889, in FrameApply.apply_series_generator(self)
    886 with option_context("mode.chained_assignment", None):
    887     for i, v in enumerate(series_gen):
    888         # ignore SettingWithCopy here in case the user mutates
--> 889         results[i] = self.f(v)
    890         if isinstance(results[i], ABCSeries):
    891             # If we have a view on v, we need to make a copy because
    892             #  series_generator will swap out the underlying data
    893             results[i] = results[i].copy(deep=False)

Input In [166], in <lambda>(x)
----> 1 (f_sp.apply(lambda x: sorted(x['tokenized_company_name'], 
      2            key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)], 
      3                                 reverse=True), axis=1))

Input In [166], in <lambda>.<locals>.<lambda>(y)
      1 (f_sp.apply(lambda x: sorted(x['tokenized_company_name'], 
----> 2            key=lambda y: tf_sp.loc[x.name,'output_column'][x['tokenized_company_name'].index(y)], 
      3                                 reverse=True), axis=1))

IndexError: list index out of range

Why is this happening? Each list of lists has the same number of elements.

CodePudding user response:

To sort the list of tokens in f_sp['tokenized_company_name'] by the corresponding value in tf_sp['output_column'], you can use the zip function to combine the two columns and then sort the resulting list of tuples by the value of the second element in each tuple (which is the corresponding value from tf_sp['output_column']). You can then extract only the first element of each tuple (which is the token) to obtain the sorted list of tokens.

Here is an example of how you can achieve this using a lambda function with the apply method of f_sp:

f_sp['tokenized_company_name'] = f_sp.apply(lambda x: [t[0] for t in sorted(zip(x['tokenized_company_name'], tf_sp.loc[x.name, 'output_column']), key=lambda t: t[1], reverse=True)], axis=1)

This will sort the list of tokens in f_sp['tokenized_company_name'] by the corresponding value in tf_sp['output_column'] and store the sorted list back in f_sp['tokenized_company_name'].

Note that this solution assumes that the length of f_sp['tokenized_company_name'] and tf_sp['output_column'] is the same for each row in f_sp. Otherwise, you may need to handle the case where the length of the two columns is different.

CodePudding user response:

To order a list of lists of strings by another list of lists of floats in Pandas, you can use the "sort_values" method. Here is an example:

import pandas as pd

# create dataframe with string lists as data
df = pd.DataFrame({'strings': [['apple', 'banana', 'cherry'],
                               ['dog', 'cat', 'bird'],
                               ['red', 'green', 'blue']]})

# create dataframe with float lists as data
df_floats = pd.DataFrame({'floats': [[1.0, 2.0, 3.0],
                                     [4.0, 5.0, 6.0],
                                     [7.0, 8.0, 9.0]]})

# sort the string dataframe by the float dataframe
df.sort_values(by=df_floats['floats'])

This will return a new dataframe with the strings in each list sorted according to the corresponding list of floats.

  • Related