Home > Software engineering >  SQL - A simple data analysis related question
SQL - A simple data analysis related question

Time:08-04

I’m new to Data Analytics and I just finished learning SQL. So far I have learnt SQL functions or clauses like SELECT, FROM, GROUP BY, HAVING, ORDER BY etc… The way these clauses work are taught individually or in bits and pieces. But how does it work in a real life project?

Let’s say I’m given a project and I’m using “table.a” and "table.b". The first thing I have to do is cleaning the data in the table for any inconsistencies like deleting the duplicates, trimming extra spaces etc. before I go to the next step.

My question is, when I clean the data, does it get saved in the original table.a and table.b for me to use in the analysis process? If the cleaned data is not saved to the original table, where does it get saved? How do I access the data which I just cleaned?

CodePudding user response:

It will depend on the project you are working. Usually, if you work as a Data Analyst there should be a Data Warehouse where you could create temporary tables to run every day (in the case that those db are updated everyday).

Another option it's to create a view and to run it every day you need to work with that query (It's more or less the same resolution as the first one. This is usually a Data Engineer Job, where you send the query and then it's automatised by that role)

A 3rd option it's to keep and save the results of your query into a .csv file and from that file go to PowerBI or Tableau, or your Data Visualization tool in order to do the analysis (Manual option, don't recommend it)

Another option it's to connect to SQL directly from your Data Visualization tool and then work with the data(if it's connect to a db it should update automatically, but if you are connected to your localhost it will depend if you change the data inside the tables or not).

I hope this helps you!

CodePudding user response:

Nothing happens in a database unless you explicitly tell it to happen. There is no "Clean my data" functionality that leads to your question "Where does my data end up if I clean it". It's the wrong question to ask. The real question is "Where should I put my cleaned data or the logic by which I clean it"

Imagine:

CREATE TABLE myTable (id int, myCol varchar(100));
INSERT INTO myTable VALUES (1, ' something '), (2, ' something else.   '), (3, 'another thing');

SELECT * FROM myTable;

 ---- ------------------ 
| id |      myCol       |
 ---- ------------------ 
|  1 |  something       |
|  2 |  something else. |
|  3 | another thing    |
 ---- ------------------ 

Look at those ugly extra spaces on the beginning and end of those myCol strings. It's awful. Let's "Clean" it.


Using an UPDATE Statement

We could run an UPDATE statement to trim myCol of those ugly extra space characters:

UPDATE myTable SET myCol = TRIM(myCol);

SELECT * FROM myTable;

 ---- ----------------- 
| id |      myCol      |
 ---- ----------------- 
|  1 | something       |
|  2 | something else. |
|  3 | another thing   |
 ---- ----------------- 

We explicitly told the database to update the values in column myCol to the trimmed value of myCol. It's nice and clean and in the same table where the original value resided.


Using a SELECT statement

Or maybe you don't want something so permanent. Those untrimmed values are gone once we run the UPDATE statement and there is no getting them back. Perhaps someone needs those space characters, but for your analysis they are messing things up. Instead you can just submit your SELECT statement and perform the trim on the fly:

SELECT id, TRIM(myCol) FROM myTable

 ---- ----------------- 
| id |      myCol      |
 ---- ----------------- 
|  1 | something       |
|  2 | something else. |
|  3 | another thing   |
 ---- ----------------- 

Using a View

You could also write that SELECT into a view so you can just select from the view without having to remember the logic for cleaning up:

CREATE VIEW myView AS SELECT TRIM(myColumn) FROM myTable;

SELECT * FROM myView

 ---- ----------------- 
| id |      myCol      |
 ---- ----------------- 
|  1 | something       |
|  2 | something else. |
|  3 | another thing   |
 ---- ----------------- 

Using a new table

If this view or the select statement are slow because the data is HUGE and you want the clean data in a table, but you don't want to change the data in the original table (since there isn't an undo button), you might make a new table:

CREATE TABLE myCleanTable AS (SELECT id, TRIM(myCol) FROM myTable)

SELECT * FROM myCleanTable;

 ---- ----------------- 
| id |      myCol      |
 ---- ----------------- 
|  1 | something       |
|  2 | something else. |
|  3 | another thing   |
 ---- ----------------- 

In the end, there is no proper/real-life way that this happens. You chose what fits your needs and tell the database to do it.

Generally we don't want 20 versions of the same table laying around so creating copies of the table is generally not a great solution. You'll find you end up with myTable_V1_Final_Use_This_One_DocPatel_Q12022_DONT_DELETE and 20 varieties of the same name.

Likewise having a TON of logic in a single VIEW where that view is selected on often can lead to database performance problems. In our example case it would have to run that TRIM() logic every time someone selects from the view. That could be a drain on your server's CPU.

On the other hand, just submitting SQL with the logic in it can get hard to manage since you have to save the SQL somewhere (notepad.exe) which makes it difficult to share clean data with other coworkers.

Updating a table carries a lot of risks since there is no undo button. Once data is overwritten the original is gone. However, it's often the right decision when everyone using that data agrees on the step to clean it. Who wants to keep garbage around, right?

It's a balance of making the right decision at the right time for your requirements. Likely whatever organization you join will already have security and standards in place to guide you and make sure you aren't making bad decisions with this stuff, so it's all good.

  • Related