how to create a multi-index based on values of another column?-CodePudding

I need to create a column that, if the first and last value in the column 'date_int' of each group (based on column ID) is 'yes', returns 'yes', otherwise it returns 'no' (which would be the output in column 'multi_index').

ID          A  index           timestamp   date_int multi_index
1   activity1      1 2021-02-01 12:03:20   yes      no
1   activity2      2 2021-02-11 12:03:20   no
1   activity3      3 2021-11-23 11:46:40   no
1   activity4      4 2021-11-24 11:46:40   no
1   activity5      5 2021-11-25 11:46:40   no
1   activity6      6 2021-11-26 11:46:40   no
2   activity7      1 2021-03-01 12:03:20   yes      yes
2   activity8      2 2021-03-11 12:03:20   no
2   activity9      3 2021-12-23 11:46:40   no
2   activity10     4 2021-12-27 11:46:40   no
2   activity11     5 2021-12-28 11:46:40   no
2   activity12     6 2021-12-29 11:46:40   yes

Any idea?

CodePudding user response：

Ambiguity: What do you mean by "last"? The highest index? Or the highest timestamp? I'll assume index.

Another issue: Did you want "no" or "" for multi? I'll assume "no".

I think this will generate the desired yes/no:

SELECT  A,
        IF(( SELECT ... ) = 'yes')
        AND( SELECT ... ) = 'yes'),
        'yes', 'no') AS multi
    FROM ( SELECT A,
                  1             AS mini,
                  MAX(`index`)  AS maxi
               FROM tbl ) AS x

Where the first SELECT is

SELECT  date_int
    FROM tbl
    WHERE tbl.A = x.A
      AND tbl.index = mini

and the second is

SELECT  date_int
    FROM tbl
    WHERE tbl.A = x.A
      AND tbl.index = maxi

Before proceeding, run that combined query. See if it delivers the 'correct' stuff.

To actually add the column:

ALTER TABLE tbl ADD COLUMN multi VARCHAR(3) NULL;

Then top populate it:

UPDATE tbl AS a
  JOIN ( ... ) AS b  ON a.A = b.A
                    AND a.index = 1
  SET a.multi = b.multi;

where the "..." is that mess I created above.

But.... Consider not including multi in the current table, since it is empty when index != 1. Instead, consider having another table with only one row per A. And a second column for multi. Then the UPDATE becomes an INSERT (with several changes).

CodePudding user response：

I couldn't do it succinctly but...

First, I created another dataframe with the results of the comparisons:

result = df.sort_values('index').groupby('ID', as_index=False).apply(lambda x: x.iloc[[0, -1]].date_int.apply(lambda x: x == 'yes').all())
result.columns = ['ID', 'check']

I used the index column to sort the dataframe and avoid unexpected results. You can change it to sort by the column that makes the most sense for you. I also renamed the columns of this auxiliar dataframe

Then, I created the desired result column:

df['multi_index'] = ''

And a function to change the value of the first row of each group to 'yes' or 'no' based on the result dateframe

def change_first_record(x):
  x['multi_index'].iloc[0] = result[result.ID == x.ID.values[0]].check.apply(lambda x: 'yes' if x else 'no').values[0]
  return x

Now, we just have to apply the above function to our original dataframe:

df = df.groupby('ID').apply(change_first_record)

Resulting in:

Hope it helps you find a cleaner way to do so!

CodePudding user response：

Here is a way using groupby() and transform()

g = df.groupby('ID')['date_int']

first,last = g.transform('first'),g.transform('last')
(df.assign(multi_index = (first.eq('yes') & last.eq('yes'))
.map({True:'yes',False:'no'})
.groupby(df['ID']).head(1)))

Output:

    ID           A  index   timestamp timestamp2 date_int multi_index
0    1   activity1      1  2021-02-01   12:03:20      yes          no
1    1   activity2      2  2021-02-11   12:03:20       no         NaN
2    1   activity3      3  2021-11-23   11:46:40       no         NaN
3    1   activity4      4  2021-11-24   11:46:40       no         NaN
4    1   activity5      5  2021-11-25   11:46:40       no         NaN
5    1   activity6      6  2021-11-26   11:46:40       no         NaN
6    2   activity7      1  2021-03-01   12:03:20      yes         yes
7    2   activity8      2  2021-03-11   12:03:20       no         NaN
8    2   activity9      3  2021-12-23   11:46:40       no         NaN
9    2  activity10      4  2021-12-27   11:46:40       no         NaN
10   2  activity11      5  2021-12-28   11:46:40       no         NaN
11   2  activity12      6  2021-12-29   11:46:40      yes         NaN