So, this is the situation: I have a CSV file who looks like this:
show_id title cast
1 Batman Robert Pattinson, Collin Farrel, Zoë Kravitz
2 Twilight Robert Pattinson, Kristen Stewart
3 Ava Jessica Chastain, Collin Farrel
What I need to do is open this CSV in a Python function, do some stuff to fix white spaces and that.
Then I need to upload that in a SQL database, (whatever I want, but I choose MySQL), no problem with that.
PROBLEM
My main issue is that then I need (because mi challenge said that) to create a Query to count how many times an actor appears in all the movies in the list. So, in this case the query should display something like this:
Actor Ammount_of_movies
Robert Pattinson 2
Collin Farrel 2
Zoë Kravitz 1
Kristen Stewart 1
Jessica Chastain 1
As you can see I don't have a name to search with LIKE or CONTAINS. So, how can I do that? Because, in the CSV, the list of actors per movie has, well, more than 1 actor, and I save them in a varchar or text type in the database, so each row has more than one actor.
Or should I create another table with actors and foreing keys to movies? Or is something than I can't do in MySQL but in other SQL is possible?
CodePudding user response:
If you look for something performance efficient you should rather split the data (create 3 tables in total - movies, actors, casts) and connect actors with movies using the casts, then write simple sql with some joins like:
Select actors.name as Actor, count(movies.title) as Amount_of_movies from actors,
inner join cast on cast.actor_id = actors.actor_id
inner join movies on movies.movie_id = cast.movie_id;
You are able to do it another way using https://sebhastian.com/mysql-split-string/ or using psql / plsql and stored procedure. I would just split the data if it's possible.
Cheers.
CodePudding user response:
You need to normalize your data and you might use:
select t1.cast as Actor ,count(t1.cast) as Ammount_of_movies
from test_tbl t1
inner join (select title,cast
from test_tbl
group by title,cast
) t2 on t1.title=t2.title and t1.cast=t2.cast
group by t1.cast
order by Ammount_of_movies desc;