I want to join or update the following two tables and also add up df
for existing words. So if the word endeavor
does not exist in the first table, it should be added with its df
value or if the word hello
exists in both tables df
should be summed up.
FYI I'm using MariaDB and PySpark to do word counts on documents and calculate tf, df, and tfidf values.
Table name: df
-------- ----
| word| df|
-------- ----
|vicinity| 5|
| hallo| 2|
| admire| 3|
| settled| 1|
-------- ----
Table name: word_list
| word| df|
---------- ---
| hallo| 1|
| settled| 1|
| endeavor| 1|
---------- ---
So in the end the updated/combined table should look like this:
| word| df|
---------- ---
| vicinity| 5|
| hallo| 3|
| admire| 3|
| settled| 2|
| endeavor| 1|
---------- ---
What I've tried to do so far is the following:
SELECT df.word, df.df word_list.df FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
None of them worked, I either get a table with just null
values, some null
values, or some exception. I'm sure there must be an easy SQL statement to achieve this but I've been stuck with this for hours and also haven't found anything relatable on stack overflow.
CodePudding user response:
You just need to UNION the two tables first, then aggregate on the word. Since the tables are identically structured it's very easy. Look at this fiddle. I have used maria 10.3 since you didn't specify, but these queries should be completely compliant with (just about) any DBMS.
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=c6d86af77f19fc1f337ad1140ef07cd2
select word, sum(df) as df
from (
select * from df
UNION ALL
select * from word_list
) z
group by word
order by sum(df) desc;
UNION is the vertical cousin of JOIN, that is, UNION joins to datasets vertically or row-wise, and JOIN adds them horizontally, that is by adding columns to the output. Both datasets need to have the same number of columns for the UNION to work, and you need to use UNION ALL here so that the union returns all rows, because the default behavior is to return unique rows. In this dataset, since settled has a value of 1 in both tables, it would only have one entry in the UNION if you don't use the ALL keyword, and so when you do the sum the value of df would be 1 instead of 2, as you are expecting.
The ORDER BY isn't necessary if you are just transferring to a new table. I just added it to get my results in the same order as your sample output.
Let me know if this worked for you.