I have a SQLite table representing a chatlog. The two important columns for this question are 'content' and 'timestamp'.
I need to group the messages in the chatlog by conversations. Each message is only an individual line, so a conversation can be selected as each message joined by a new line using group_concat
group_concat(content, CHAR(10)
I want to identify a conversation by any messages which are within a length of time (such as 15 minutes) from each other. A conversation can be any length (including just an individual message, if there are no other messages within 15 minutes of it).
Knowing this, I can identify whether a message is the start or part of a conversation as
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
But this is as far as I've gotten. I can make a column 'is_new_convo' using
WITH ordered_messages AS (
SELECT content, timestamp
FROM messages
ORDER BY timestamp
), conversations_identified AS (
SELECT *,
CASE
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
THEN 0
ELSE 1
END AS is_new_convo
FROM ordered_messages
) SELECT * FROM conversations_identified
How can I then form a group of messages from where is_new_convo = 1
to the last subsequent is_new_convo = 0
?
Here is some sample data and the expected result.
CodePudding user response:
If you take the sum of the is_new_convo
column from the start to a certain row, you get the number of times a new conversation has been formed, resulting in an ID that is unique for all messages in a conversation (since is_new_convo
is 0 for messages continuing a conversation, they result in the same conversation ID). Using this, we can find the conversation ID for all messages, then group them together for group_concat
. This doesn't require referencing the original table multiple times, so the 'WITH' clauses aren't needed.
SELECT group_concat(content, CHAR(10)) as conversation
FROM (
SELECT content, timestamp,
SUM(is_new_convo) OVER (ORDER BY timestamp) as conversation_id
FROM (
SELECT content, timestamp,
CASE
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
THEN 0
ELSE 1
END AS is_new_convo
FROM messages
)
) GROUP BY conversation_id