Home > Net >  Extract data and sort them by date
Extract data and sort them by date

Time:11-25

I am trying to figure out an exercise on string manipulation and sorting. The exercise asks to extract words that have time reference (e.g., hours, days) from the text, and sort rows based on the time extracted in an ascendent order. An example of data is:

Customer     Text
1            12 hours ago — the customer applied for a discount
2            6 hours ago — the customer contacted the customer service
3            1 day ago — the customer reported an issue
4            1 day ago — no answer
4            2 days ago — Open issue
5            

In this task I can identify several difficulties:

- time reference can be expressed as hours/days/weeks
- there are null values or no reference to time
- get a time format suitable and more general, e.g., based on the current datetime

On the first point, I noted that generally the dates are before —, whether present, so it could be easy to extract them. On the second point, an if statement could avoid error messages due to incomplete/missing fields. I do not know how to answer to the third point, though.

My expected result would be:

Customer     Text                                                        Sort by
1            12 hours ago — the customer applied for a discount             1
2            6 hours ago — the customer contacted the customer service      2
3            1 day ago — the customer reported an issue                     2
4            1 day ago — no answer                                          2
4            2 days ago — Open issue                                        3
5            

CodePudding user response:

Given the DataFrame sample, I will assume that for this exercise that the first two words of the text are what you are after. I am unclear on how the sorting works, but for the third point, a more suitable time would be the current time minus the timedelta from by the Text column

You can apply an if-else lambda function to the first two words of each row of Text and convert this to a pandas Timedelta object - for example pd.Timedelta("1 day") will return a Timedelta object.

Then you can subtract the Timedelta column from the current time which you can obtain with pd.Timestamp.now():

df["Timedelta"] = df.Text.apply(lambda x: pd.Timedelta(' '.join(x.split(" ")[:2])) if pd.notnull(x) else x)
df["Time"] = pd.Timestamp.now() - df["Timedelta"]

Output:

>>> df
   Customer                                               Text       Timedelta                       Time
0         1  12 hours ago — the customer applied for a disc... 0 days 12:00:00 2021-11-23 09:22:40.691768
1         2  6 hours ago — the customer contacted the custo... 0 days 06:00:00 2021-11-23 15:22:40.691768
2         3         1 day ago — the customer reported an issue 1 days 00:00:00 2021-11-22 21:22:40.691768
3         4                              1 day ago — no answer 1 days 00:00:00 2021-11-22 21:22:40.691768
4         4                            2 days ago — Open issue 2 days 00:00:00 2021-11-21 21:22:40.691768
5         5                                                NaN             NaT                        NaT
  • Related