This is response from one sql query in pandas. I want to concatenate, "Label column", together based on Issue and client. I tried Group By, but is only works with integer values. Any idea, how I can do this? Pandas based solution should also be fine.
I tried .groupby
, in pandas too, command and output below, it only gives me a subset of needed dataframe.
Is it possible to update the label
column against each Issue
in first dataframe and removed duplicates and get output as expected below?
SQL Version is:
Microsoft SQL Server 2014
Output:
Issue | Subject | type | Team | Sub Team | Client | Priority | CreatedOn | Label | BuiltOn | CreatedBy | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | Enhancement | None | John | InProgress |
1 | 2 | DEFDEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | Feature | None | Andre | New |
2 | 3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | Feature | None | Keith | New |
3 | 3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | Internal | None | Keith | New |
4 | 4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
5 | 4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
6 | 4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
7 | 4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
8 | 4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
9 | 4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
10 | 4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
11 | 4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
12 | 4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
13 | 4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
Expected (Note the label column):
Issue | Subject | Issue_type | Team | Sub Team | Client | Priority | CreatedOn | Label | BuiltOn | CreatedBy | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | Enhancement | None | John | InProgress |
1 | 2 | DEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | Feature | None | Andre | New |
2 | 3 | HIG | Bug | Develop | Testing | Cathy | 2 | 2021-02-11 00:00:00 | Feature, Internal | None | Keith | New |
3 | 4 | XYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
4 | 4 | XYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
5 | 4 | XYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
6 | 4 | XYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
7 | 4 | XYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
Update: This is they query:
SELECT I.Issue,
I.Subject,
I.type,
P.Team,
P.Subteam,
CR.Client,
I.Priority,
I.CreatedOn,
L.Label,
I.BuiltOn,
I.CreatedBy,
I.Status
FROM master.IssueRequests AS I
JOIN master.Participants AS P
ON P.Issue = I.Issue
JOIN master.ClientRecords AS CR
ON CR.Issue = I.Issue
JOIN master.IssueLabels AS L
ON L.Issue = I.Issue
WHERE I.Issue IN ('2652523', '2703670', '2984120')
Update2
Output of df.groupby
:
df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index()
Output:
Issue | Client | Label | |
---|---|---|---|
0 | 1 | Andy | Enhancement |
1 | 2 | Judy | Feature |
2 | 3 | Cathy | Feature,Internal |
3 | 4 | Colin | Enhancement,Internal |
4 | 4 | Jack | Enhancement,Internal |
5 | 4 | Larry | Enhancement,Internal |
6 | 4 | Lisa | Enhancement,Internal |
7 | 4 | Nitin | Enhancement,Internal |
Clarification: Merging on all columns except Label
will not work since in some cases, some of the other data might be "null" or different, which can cause data to be missed altogether. If the data is different in other columns, I can keep the first instance of that data.
CodePudding user response:
Update:
After clarification of the OP, it seems that the problem actually is a bit different: The columns' contents except the grouping columns Issue
and Client
may actually differ between grouped rows and the final result should contain the first row's value for columns that differ between such grouped rows.
An approach to do this could be to perform the grouping in Python as you did before and then join (using merge()
) that with a version of the original dataframe in which you drop all duplicates based on Issue
and Client
(as well as the Label
column). That gives you exactly the first instance for each grouped row in case data differs.
Without additional arguments, merge()
will automatically do an inner join on all columns available in both dataframes, which in this case are Issue
and Client
:
df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index().merge(df.drop('Label', axis=1).drop_duplicates(['Issue', 'Client']))
Output:
Issue | Subject | type | Team | Sub Team | Client | Priority | CreatedOn | BuiltOn | CreatedBy | Status | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | None | John | InProgress | Enhancement |
1 | 2 | DEFDEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | None | Andre | New | Feature |
2 | 3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | None | Keith | New | Feature,Internal |
3 | 4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | 4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
5 | 4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
6 | 4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
7 | 4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
Previous answer:
Just group by all of the other columns:
import pandas as pd
data = [[1, 'ABCABC', 'Bug', 'Develop', 'Automation', 'Andy', 0, '2021-01-11 00:00:00', 'Enhancement', 'None', 'John', 'InProgress'],
[2, 'DEFDEF', 'Bug', 'Develop', 'Automation', 'Judy', 0, '2021-01-10 00:00:00', 'Feature', 'None', 'Andre', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Feature', 'None', 'Keith', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Internal', 'None', 'Keith', 'New'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis']]
df = pd.DataFrame(data, columns = ['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'Label', 'BuiltOn', 'CreatedBy', 'Status'])
df.groupby(['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'BuiltOn', 'CreatedBy', 'Status'])['Label'].apply(','.join).reset_index()
If you don't want to write all column names, you can also use list comprehension to automatically build the list and exclude just the Label
column from it, similar to this SO answer:
df.groupby([col for col in list(df) if col not in ['Label']])['Label'].apply(','.join).reset_index()
CodePudding user response:
I'll offer a second approach using SQL.
Concatenating strings in a GROUP BY
is a bit tricky in MS Sql Server, since there is no direct function as in other RDBMS. However, there is a workaround using FOR XML
and PATH
that can be adapted to your problem.
The following statement gives you the concatenated labels based on your original query:
WITH tmp AS (SELECT I.Issue,
I.Subject,
I.type,
P.Team,
P.Subteam,
CR.Client,
I.Priority,
I.CreatedOn,
L.Label,
I.BuiltOn,
I.CreatedBy,
I.Status
FROM master.IssueRequests AS I
JOIN master.Participants AS P
ON P.Issue = I.Issue
JOIN master.ClientRecords AS CR
ON CR.Issue = I.Issue
JOIN master.IssueLabels AS L
ON L.Issue = I.Issue
WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT A.Issue,
A.Client,
STUFF((
SELECT ', ' B.Label
FROM tmp B
WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
ORDER BY B.Issue
FOR XML PATH('')), 1, 2, ''
) AS Label
FROM
tmp A
GROUP BY
A.Issue, A.Client
This gives you
Issue | Client | Label |
---|---|---|
1 | Andy | Enhancement |
2 | Judy | Feature |
3 | Cathy | Feature,Internal |
4 | Colin | Enhancement,Internal |
4 | Jack | Enhancement,Internal |
4 | Larry | Enhancement,Internal |
4 | Lisa | Enhancement,Internal |
4 | Nitin | Enhancement,Internal |
You can then use ROW_NUMBER()
to JOIN
this with the first row of each Issue
-Client
-combination:
WITH tmp AS (SELECT I.Issue,
I.Subject,
I.type,
P.Team,
P.Subteam,
CR.Client,
I.Priority,
I.CreatedOn,
L.Label,
I.BuiltOn,
I.CreatedBy,
I.Status
FROM master.IssueRequests AS I
JOIN master.Participants AS P
ON P.Issue = I.Issue
JOIN master.ClientRecords AS CR
ON CR.Issue = I.Issue
JOIN master.IssueLabels AS L
ON L.Issue = I.Issue
WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT C.Issue,
C.Subject,
C.typ,
C.Team,
C.Subteam,
C.Client,
C.Priority,
C.CreatedOn,
D.Label,
C.BuiltOn,
C.CreatedBy,
C.Status
FROM (SELECT tmp.*,
row_number() OVER (PARTITION BY Issue, Client ORDER BY Issue) as rn
FROM tmp) C
JOIN (SELECT A.Issue,
A.Client,
STUFF((
SELECT ', ' B.Label
FROM tmp B
WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
ORDER BY B.Issue
FOR XML PATH('')), 1, 2, ''
) AS Label
FROM tmp A
GROUP BY A.Issue, A.Client) D
ON C.Issue = D.Issue AND C.Client = D.Client AND C.rn = 1
which gives you your desired result:
Issue | Subject | type | Team | Sub Team | Client | Priority | CreatedOn | BuiltOn | CreatedBy | Status | Label |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ABCABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | None | John | InProgress | Enhancement |
2 | DEFDEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | None | Andre | New | Feature |
3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | None | Keith | New | Feature,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
You can test it in this db<>dfiddle