Home > Blockchain >  Select conversations from a SQLite chatlog
Select conversations from a SQLite chatlog

Time:01-04

I have a SQLite table representing a chatlog. The two important columns for this question are 'content' and 'timestamp'.

I need to group the messages in the chatlog by conversations. Each message is only an individual line, so a conversation can be selected as each message joined by a new line using group_concat

group_concat(content, CHAR(10)

I want to identify a conversation by any messages which are within a length of time (such as 15 minutes) from each other. A conversation can be any length (including just an individual message, if there are no other messages within 15 minutes of it).

Knowing this, I can identify whether a message is the start or part of a conversation as

WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900

But this is as far as I've gotten. I can make a column 'is_new_convo' using

WITH ordered_messages AS (
    SELECT content, timestamp
    FROM messages
    ORDER BY timestamp
), conversations_identified AS (
    SELECT *,
        CASE
            WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
            THEN 0
            ELSE 1
        END AS is_new_convo
    FROM ordered_messages 
) SELECT * FROM conversations_identified 

How can I then form a group of messages from where is_new_convo = 1 to the last subsequent is_new_convo = 0?

Here is some sample data and the expected result.

CodePudding user response:

If you take the sum of the is_new_convo column from the start to a certain row, you get the number of times a new conversation has been formed, resulting in an ID that is unique for all messages in a conversation (since is_new_convo is 0 for messages continuing a conversation, they result in the same conversation ID). Using this, we can find the conversation ID for all messages, then group them together for group_concat. This doesn't require referencing the original table multiple times, so the 'WITH' clauses aren't needed.

SELECT group_concat(content, CHAR(10)) as conversation
FROM (
    SELECT content, timestamp,
        SUM(is_new_convo) OVER (ORDER BY timestamp) as conversation_id
    FROM (
        SELECT content, timestamp,
            CASE
                WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
                THEN 0
                ELSE 1
            END AS is_new_convo
        FROM messages
    )
) GROUP BY conversation_id
  • Related