I have a DataFrame with an ID and Text like below:
df1
ID | Text |
---|---|
1 | I have completed my order |
2 | I have made the payment. When can I expect the order to be delivered? |
3 | I am unable to make the payment. |
4 | I am done with registration and payment. I need the order number? |
5 | I am unable to complete registration. How will I even order? |
I have certain topics to classify these texts: class = ["order", "payment", "registration"]
I am doing the following which gets me the results:
classes = ["order", "payment", "registration"]
for c in classes:
word_counter = Counter()
list_df = []
field = "Text"
df2 = pd.DataFrame()
df2 = df2[df2[field].str.contains(c)]
print(c)
list_df.append(df2)
final_df = pd.concat(list_df)
final_df.to_csv("./" c ".csv")
This will generate me 3 CSV files which I will later join again:
file_list = []
os.chdir('<file path>')
for file in os.listdir():
if file.endswith('.csv'):
df = pd.read_csv(file, sep=",", encoding='ISO-8859-1')
df['filename'] = file
file_list.append(df)
df_topic = pd.concat(file_list, ignore_index=True)
df_topic['topic'] = df_topic['filename'].str.split('.').str[0]
df_topic= df_topic.drop('filename', 1)
The resultant DataFrame looks like this:
ID | Text | Topic |
---|---|---|
1 | I have completed my order | order |
2 | I have made the payment. When can I expect the order to be delivered? | order |
4 | I am done with registration and payment. I need the order number? | order |
2 | I have made the payment. When can I expect the order to be delivered? | payment |
3 | I am unable to make the payment. | payment |
4 | I am done with registration and payment. I need the order number? | payment |
4 | I am done with registration and payment. I need the order number? | registration |
5 | I am unable to complete registration. How will I even order? | registration |
But, the problem you see here is that same text may have the keywords for the other classes too and can be tagged as either (like text for id=2 has both order and payment). I can only have one record label for each id and thus would prefer to have it as Primary or Secondary topic based on the sequence of their occurrence from the beginning of the text. If a text has more than 2 then first 2 gets preference but just to ensure we may need the third topic (or nth topic) for a future instance I would like to store it as a list in the final field. (Example for id = 4 is illustrated)
ID | Text | Primary Topic | Secondary Topic | Identified Topics | Topics List |
---|---|---|---|---|---|
1 | I have completed my order | order | null | 1 | [order] |
2 | I have made the payment. When can I expect the order to be delivered? | payment | order | 2 | [payment,order] |
3 | I am unable to make the payment. | payment | null | 1 | [payment] |
4 | I am done with registration and payment. I need the order number? | registration | payment | 3 | [registration,payment,order] |
5 | I am unable to complete registeration. How will I even order? | registration | order | 2 | [registration,order] |
Is it possible to do it this way. If not, what is a good way to approach such labelling issues?
CodePudding user response:
IIUC, you could use str.extractall
combined with GroupBy.agg
:
lst = ["order", "payment", "registration"]
regex = f'({"|".join(lst)})' # if lst contains special chars, wrap in re.escape
df2 = df.join(df['Text']
.str.extractall(regex)[0]
.groupby(level=0).agg(**{'Primary Topic': 'first',
'Secondary Topic': lambda x: x.iloc[1] if len(x)>1 else 'null',
'Identified Topics': 'nunique',
'Topics List': list})
)
output:
ID Text Primary Topic Secondary Topic Identified Topics Topics List
0 1 I have completed my order order null 1 [order]
1 2 I have made the payment. When can I expect the order to be delivered? payment order 2 [payment, order]
2 3 I am unable to make the payment. payment null 1 [payment]
3 4 I am done with registration and payment. I need the order number? registration payment 3 [registration, payment, order]
4 5 I am unable to complete registration. How will I even order\ registration order 2 [registration, order]