With SQL, how do I deal with JOINing databases with a lot of columns that all need to be unique?-CodePudding

Okay so this is the first time I'm working with a big database and it's quite scary. This is an example what I want to have happen:

Tables

table 1
ID   art1  art2
1    90    20
2    20    80
3    20    20

table 2
ID   art1  art2
1    20    20
2    40    30
4    20    50

Desired Result (order doesn't matter)

table 1
ID   art1  art2
1    ...
2
3
4

I kind of get that in a small scale, I use LEFT JOIN for this, and from what I read, GROUP BY for at least the attributes (if not the ID?).

My problem is that these tables are huge. There is 30 or more columns and about 25k rows.

So am I expected to write 30 GROUP BYs? Isn't there something more efficient? Like GROUP ALL?

There is also a weird thing about these tables. They have a lof of Null rows (that have attr 1 in some columns), and they all have ID of 0. But they have to stay there as the table, for functional reasons, has to have exactly 26001 rows. So after I'm done I have to shave off as many rows as I've added, but I can't do that outside of SQL as well as that's faster for me.

Also is my thinking even correct? So far I've tried only one query, before I found out about GROUP BY. I waited 5 minutes for about half a million rows, so that wasn't good. My query was:

SELECT *
FROM `table1` 
LEFT JOIN `table2`
USING (ID)

And now I'm thinking it should be

SELECT *
FROM `table1` 
LEFT JOIN `table2`
USING (ID)
GROUP BY *insert all columns?*

But I'm not sure, do I also have to "line up" all the columns to not get repeated results? Or do I have to use DISTINCT? On all 30 columns again?

CodePudding user response：

GROUP BY is used to group together rows with the same values in certain columns, and then aggregate the values in other columns. In your case, it sounds like you want to include all rows from both tables, so there's no need to use GROUP BY.

Instead of selecting all columns "SELECT *", you should specify the columns that you actually need in the query. This will also help to improve performance.

DISTINCT is used to remove duplicates from the result set. If you are getting repeated results, it could be because of duplicate values in the ID column. You can use DISTINCT to remove those duplicates.

SELECT *
FROM `table1` 
LEFT JOIN `table2`
ON table1.ID = table2.ID
WHERE table1.ID != 0

The where will exclude the rows with ID = 0 from both tables.

Also worth noting that the performance of your query can be improved by creating indexes on the ID column of both tables.

Hope it helps

CodePudding user response：

considering other comments and answers, my suggestion is:

Use SELECT * with caution, usually it will query everything including things that you don't need. Since BigQuery is colunar, it will increase billing of the query the same way you increase the number of columns. Take your case into consideration here.
If you want to give join another try, consider using inner join for your case, because it will filter out what you doesn't want, and it performs better than the where clause at the end, considering the big data scenario, that's a win.
UNION can be a great solution, even if you have already tested. To address your concerns about data duplicity: BigQuery will consider all columns. So if your ID is duplicated but has different data, you will have to specify which is the correct one. To do that, consider:
1. using a window function to do the row count using row_number() function. Window functions performs better than group by ones in the big data scenario.
2. make sure you add all the conditions to identify uniqueness on the partition by of the window. Also, a good way to do this is to order as DESC, so the first record is the one you want.
3. once you have done that, you can filter them using the 0. Every record that has the row number equals to zero is unique and is the one you want.

Window functions are very useful in the big data scenario, and here there is a magnificent article to understand that better.