Home > Enterprise >  How to get the number of IDs that have the same combination of distinct values in the 'location
How to get the number of IDs that have the same combination of distinct values in the 'location

Time:03-22

I have a table with ids and locations they have been to.

id Location
1 Maryland
1 Iowa
2 Maryland
2 Texas
3 Georgia
3 Iowa
4 Maryland
4 Iowa
5 Maryland
5 Iowa
5 Texas

I'd like to perform a query that would allow me to get the number of ids per combination.

In this example table, the output would be -

Maryland, Iowa - 2
Maryland, Texas - 1
Georgia, Iowa - 1
Maryland, Iowa, Texas - 1

My original thought was to add the ASCII values of the distinct locations of each id, and see how many have each value, and what the combinations are that correspond to the value. I was not able to do that as SQL server would not let me cast an nvarchar as a numeric data type. Is there any other way I could use SQL to get the number of devices per combination? Using python to get the number of ids per combination is also acceptable, however, SQL is preferred.

CodePudding user response:

If you want to solve this in SQL and you are running SQL Server 2017 or later, you can use a CTE to aggregate the locations for each id using STRING_AGG, and then count the occurrences of each aggregated string:

WITH all_locations AS (
  SELECT STRING_AGG(Location, ', ') WITHIN GROUP (ORDER BY Location) AS aloc
  FROM locations
  GROUP BY id
)
SELECT aloc, COUNT(*) AS cnt
FROM all_locations
GROUP BY aloc
ORDER BY cnt, aloc

Output:

aloc                    cnt
Georgia, Iowa           1
Iowa, Maryland, Texas   1
Maryland, Texas         1
Iowa, Maryland          2

Note that I have applied an ordering to the STRING_AGG to ensure that someone who visits Maryland and then Iowa is treated the same way as someone who visits Iowa and then Maryland. If this is not the desired behaviour, simply delete the WITHIN GROUP clause.

Demo on dbfiddle

CodePudding user response:

Use groupby agg value_counts:

new_df = df.groupby('id')['Location'].agg(list).str.join(', ').value_counts().reset_index()

Output:

>>> new_df
                   index  Location
0         Maryland, Iowa         2
1        Maryland, Texas         1
2          Georgia, Iowa         1
3  Maryland, Iowa, Texas         1

CodePudding user response:

new_df index Location 0 Maryland, Iowa 2 1 Maryland, Texas 1 2 Georgia, Iowa 1 3 Maryland, Iowa, Texas 1

  • Related