I have the following df
:
Score num_comments titles
0 134 518 Uhaul implement nicotine-free hiring policy
1 28 43 Orangutan saves child from a giant volcano
2 30 114 Swimmer dies in a horrific shark attack in harbour
3 745 298 More teenagers than ever are addicted to glue
4 40 67 Lebanese lawyers union accuse Al Capone of fraud
...
9366 345 32 City of Louisville closed off this summer
9367 1200 234 New york rats "stronger than ever", reports say
9368 432 123 Congolese militia shipwrecked in Norway
9369 594 203 Scientists now agree on how to use ice in drinks
9370 611 153 Historic drought hits Atlantis
Now I would like to create a new dataframe where I can see what score and how many comments each word gets. Like this: df2
Word score num_comments
Uhaul 134 518
implement 134 518
nicotine-free 134 518
hiring 134 518
policy 134 518
Orangutan 28 43
saves 28 43
child 28 43
from 28 43
a 28 43
giant 28 43
volcano 28 43
...
etc..
I have tried Splitting the title into separate words and then exploding:
In [9]: df3
Out[9]:
df3['titles_split'] = df['titles'].str.split()
This gave me a column that looked like this:
Score num_comments titles_split
0 134 518 [Uhaul, implement, nicotine-free, hiring, policy]
1 28 43 [Orangutan, saves, child, from, a, giant, volcano]
2 30 114 [Swimmer, dies, in, a, horrific, shark, attack, in, harbour]
3 745 298 [More, teenagers, than, ever, are, addicted, to, glue]
4 40 67 [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud]
...
9366 345 32 [City, of, Louisville, closed, off, this, summer]
9367 1200 234 [New, york, rats, stronger, than, ever, reports, say]
9368 432 123 [Congolese, militia, shipwrecked, in, Norway]
9369 594 203 [Scientists, now, agree, on, how, to, use, ice, in, drinks]
9370 611 153 [Historic, drought, hits, Atlantis]
Then I tried this code:
df3.explode(df3.assign(titles_split=df3.titles_split.str.split(',')), 'titles_split')
But I got the following error message:
ValueError: column must be a scalar, tuple, or list thereof
The same thing happened when I tried it for titles in df2.
I also tried creating new columns that repeated scores and num_comments as many times as there are words in titles (or titles_split). The idea was to create a dataframe like this:
In [9]: df4
Out[9]:
Score num_comments titles_split score_repeated
0 134 518 [Uhaul, implement, nicotine-free, hiring, policy] 134,134,134,134,134,134
1 28 43 [Orangutan, saves, child, from, a, giant, volcano] 28,28,28,28,28,28,28
2 30 114 [Swimmer, dies, in, a, horrific, shark, attack, in, harbour] 30,30,30 etc..
3 745 298 [More, teenagers, than, ever, are, addicted, to, glue] etc.
4 40 67 [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud] etc
...
9366 345 32 [City, of, Louisville, closed, off, this, summer] etc
9367 1200 234 [New, york, rats, stronger, than, ever, reports, say] etc
9368 432 123 [Congolese, militia, shipwrecked, in, Norway] etc
9369 594 203 [Scientists, now, agree, on, how, to, use, ice, in, drinks] etc
9370 611 153 [Historic, drought, hits, Atlantis] etc
And then explode on titles_split, score_repeated and comments_repeated like this:
df4.explode(['titles_split', 'score_repeated', 'comments_repeated'])
But I never got to that point because I couldn't get repeated columns. I tried the following code:
df3['score_repeat'] = df3.apply(lambda x: [x.score] * len(x.titles_split) , axis =1)
Which gave me this error message:
TypeError: object of type 'float' has no len()
Then I tried:
df3['score_repeat'] = [[y] * x for x, y in zip(df3['titles_split'].str.len(),df['score'])]
Which gave me:
TypeError: can't multiply sequence by non-int of type 'float'
But I am not even sure I am going about this the right way. Do I even need to create score_repeated and comments_repeated?
CodePudding user response:
Assume this is the df
Score num_comments titles
0 134 518 Uhaul implement nicotine-free hiring policy
1 28 43 Orangutan saves child from a giant volcano
2 30 114 Swimmer dies in a horrific shark attack in harbour
3 745 298 More teenagers than ever are addicted to glue
4 40 67 Lebanese lawyers union accuse Al Capone of fraud
5 345 32 City of Louisville closed off this summer
6 1200 234 New york rats "stronger than ever", reports say
7 432 123 Congolese militia shipwrecked in Norway
8 594 203 Scientists now agree on how to use ice in drinks
9 611 153 Historic drought hits Atlantis
You can try the following:
df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '') #data cleaning
df['titles'] = df['titles'].str.split() #split sentence into a list (of single words)
df2 = df.explode('titles', ignore_index=True)
df2.columns = ['score', 'num_comments', 'word']
print(df2)
score num_comments word
0 134 518 Uhaul
1 134 518 implement
2 134 518 nicotine-free
3 134 518 hiring
4 134 518 policy
5 28 43 Orangutan
6 28 43 saves
7 28 43 child
8 28 43 from
9 28 43 a
10 28 43 giant
11 28 43 volcano
12 30 114 Swimmer
13 30 114 dies
14 30 114 in
15 30 114 a
16 30 114 horrific
17 30 114 shark
18 30 114 attack
19 30 114 in
20 30 114 harbour
21 745 298 More
22 745 298 teenagers
23 745 298 than
24 745 298 ever
25 745 298 are
26 745 298 addicted
27 745 298 to
28 745 298 glue
29 40 67 Lebanese
30 40 67 lawyers
31 40 67 union
32 40 67 accuse
33 40 67 Al
34 40 67 Capone
35 40 67 of
36 40 67 fraud
37 345 32 City
38 345 32 of
39 345 32 Louisville
40 345 32 closed
41 345 32 off
42 345 32 this
43 345 32 summer
44 1200 234 New
45 1200 234 york
46 1200 234 rats
47 1200 234 stronger
48 1200 234 than
49 1200 234 ever
50 1200 234 reports
51 1200 234 say
52 432 123 Congolese
53 432 123 militia
54 432 123 shipwrecked
55 432 123 in
56 432 123 Norway
57 594 203 Scientists
58 594 203 now
59 594 203 agree
60 594 203 on
61 594 203 how
62 594 203 to
63 594 203 use
64 594 203 ice
65 594 203 in
66 594 203 drinks
67 611 153 Historic
68 611 153 drought
69 611 153 hits
70 611 153 Atlantis
Data cleaning needed
I noticed that there are strings with "
(double-quotations) and ,
(comma), meaning the data is not clean. You could do some data cleaning with the following:
df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '')