How can I extend an SQL table with new primary keys as well as add up values for exisiting keys?-CodePudding

I want to join or update the following two tables and also add up df for existing words. So if the word endeavor does not exist in the first table, it should be added with its df value or if the word hello exists in both tables df should be summed up.

FYI I'm using MariaDB and PySpark to do word counts on documents and calculate tf, df, and tfidf values.

Table name: df

 -------- ---- 
|    word|  df|
 -------- ---- 
|vicinity|   5|
|   hallo|   2|
|  admire|   3|
| settled|   1|
 -------- ----

Table name: word_list

|      word| df|
 ---------- --- 
|     hallo|  1|
|   settled|  1|
|  endeavor|  1|
 ---------- ---

So in the end the updated/combined table should look like this:

|      word| df|
 ---------- --- 
|  vicinity|  5|
|     hallo|  3|
|    admire|  3|
|   settled|  2|
|  endeavor|  1|
 ---------- ---

What I've tried to do so far is the following:

SELECT df.word, df.df   word_list.df FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df FULL OUTER JOIN word_list ON df.word=word_list.word

None of them worked, I either get a table with just null values, some null values, or some exception. I'm sure there must be an easy SQL statement to achieve this but I've been stuck with this for hours and also haven't found anything relatable on stack overflow.

CodePudding user response：

You just need to UNION the two tables first, then aggregate on the word. Since the tables are identically structured it's very easy. Look at this fiddle. I have used maria 10.3 since you didn't specify, but these queries should be completely compliant with (just about) any DBMS.

https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=c6d86af77f19fc1f337ad1140ef07cd2

select word, sum(df) as df
from (
  select * from df 
  UNION ALL
  select * from word_list
) z
group by word
order by sum(df) desc;

UNION is the vertical cousin of JOIN, that is, UNION joins to datasets vertically or row-wise, and JOIN adds them horizontally, that is by adding columns to the output. Both datasets need to have the same number of columns for the UNION to work, and you need to use UNION ALL here so that the union returns all rows, because the default behavior is to return unique rows. In this dataset, since settled has a value of 1 in both tables, it would only have one entry in the UNION if you don't use the ALL keyword, and so when you do the sum the value of df would be 1 instead of 2, as you are expecting.

The ORDER BY isn't necessary if you are just transferring to a new table. I just added it to get my results in the same order as your sample output.

Let me know if this worked for you.