Label any text with multiple topics in sequence of their occurrence-CodePudding

I have a DataFrame with an ID and Text like below:

df1

ID	Text
1	I have completed my order
2	I have made the payment. When can I expect the order to be delivered?
3	I am unable to make the payment.
4	I am done with registration and payment. I need the order number?
5	I am unable to complete registration. How will I even order?

I have certain topics to classify these texts: class = ["order", "payment", "registration"]

I am doing the following which gets me the results:

classes = ["order", "payment", "registration"]
for c in classes:
    word_counter = Counter()
    list_df = []
    field = "Text"
    df2 = pd.DataFrame()
    df2 = df2[df2[field].str.contains(c)] 
    print(c)
    list_df.append(df2)
    final_df = pd.concat(list_df)
    final_df.to_csv("./"   c   ".csv")

This will generate me 3 CSV files which I will later join again:

file_list = []
os.chdir('<file path>')

for file in os.listdir():
    if file.endswith('.csv'):
        df = pd.read_csv(file, sep=",", encoding='ISO-8859-1')
        df['filename'] = file
        file_list.append(df)

df_topic = pd.concat(file_list, ignore_index=True)
df_topic['topic'] = df_topic['filename'].str.split('.').str[0]
df_topic= df_topic.drop('filename', 1)

The resultant DataFrame looks like this:

ID	Text	Topic
1	I have completed my order	order
2	I have made the payment. When can I expect the order to be delivered?	order
4	I am done with registration and payment. I need the order number?	order
2	I have made the payment. When can I expect the order to be delivered?	payment
3	I am unable to make the payment.	payment
4	I am done with registration and payment. I need the order number?	payment
4	I am done with registration and payment. I need the order number?	registration
5	I am unable to complete registration. How will I even order?	registration

But, the problem you see here is that same text may have the keywords for the other classes too and can be tagged as either (like text for id=2 has both order and payment). I can only have one record label for each id and thus would prefer to have it as Primary or Secondary topic based on the sequence of their occurrence from the beginning of the text. If a text has more than 2 then first 2 gets preference but just to ensure we may need the third topic (or nth topic) for a future instance I would like to store it as a list in the final field. (Example for id = 4 is illustrated)

ID	Text	Primary Topic	Secondary Topic	Identified Topics	Topics List
1	I have completed my order	order	null	1	[order]
2	I have made the payment. When can I expect the order to be delivered?	payment	order	2	[payment,order]
3	I am unable to make the payment.	payment	null	1	[payment]
4	I am done with registration and payment. I need the order number?	registration	payment	3	[registration,payment,order]
5	I am unable to complete registeration. How will I even order?	registration	order	2	[registration,order]

Is it possible to do it this way. If not, what is a good way to approach such labelling issues?

CodePudding user response：

IIUC, you could use str.extractall combined with GroupBy.agg:

lst = ["order", "payment", "registration"]
regex = f'({"|".join(lst)})'  # if lst contains special chars, wrap in re.escape
df2 = df.join(df['Text']
              .str.extractall(regex)[0]
              .groupby(level=0).agg(**{'Primary Topic': 'first',
                                       'Secondary Topic': lambda x: x.iloc[1] if len(x)>1 else 'null',
                                       'Identified Topics': 'nunique',
                                       'Topics List': list})
               )

output:

   ID                                                                   Text Primary Topic Secondary Topic  Identified Topics                     Topics List
0   1                                              I have completed my order         order            null                  1                         [order]
1   2  I have made the payment. When can I expect the order to be delivered?       payment           order                  2                [payment, order]
2   3                                       I am unable to make the payment.       payment            null                  1                       [payment]
3   4      I am done with registration and payment. I need the order number?  registration         payment                  3  [registration, payment, order]
4   5           I am unable to complete registration. How will I even order\  registration           order                  2           [registration, order]