Home > Mobile >  How do I explode my dataframe based on each word in a column?
How do I explode my dataframe based on each word in a column?

Time:08-22

I have the following df:

        Score       num_comments     titles
0       134         518              Uhaul implement nicotine-free hiring policy
1       28          43               Orangutan saves child from a giant volcano
2       30          114              Swimmer dies in a horrific shark attack in harbour
3       745         298              More teenagers than ever are addicted to glue 
4       40          67               Lebanese lawyers union accuse Al Capone of fraud
...
9366    345         32               City of Louisville closed off this summer
9367    1200        234              New york rats "stronger than ever", reports say
9368    432         123              Congolese militia shipwrecked in Norway
9369    594         203              Scientists now agree on how to use ice in drinks
9370    611         153              Historic drought hits Atlantis

Now I would like to create a new dataframe where I can see what score and how many comments each word gets. Like this: df2

Word           score    num_comments
Uhaul          134      518
implement      134      518
nicotine-free  134      518
hiring         134      518
policy         134      518
Orangutan      28       43
saves          28       43
child          28       43
from           28       43
a              28       43
giant          28       43
volcano        28       43
...

etc..

I have tried Splitting the title into separate words and then exploding:

In [9]: df3
Out[9]:
    df3['titles_split'] = df['titles'].str.split()
This gave me a column that looked like this:
         Score       num_comments     titles_split
    0       134         518               [Uhaul, implement, nicotine-free, hiring, policy]
    1       28          43               [Orangutan, saves, child, from, a, giant, volcano]
    2       30          114    [Swimmer, dies, in, a, horrific, shark, attack, in, harbour]
    3       745         298          [More, teenagers, than, ever, are, addicted, to, glue]
    4       40          67        [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud]
    ...
    9366    345         32                [City, of, Louisville, closed, off, this, summer]
    9367    1200        234           [New, york, rats, stronger, than, ever, reports, say]
    9368    432         123                   [Congolese, militia, shipwrecked, in, Norway]
    9369    594         203     [Scientists, now, agree, on, how, to, use, ice, in, drinks]
    9370    611         153                             [Historic, drought, hits, Atlantis]

Then I tried this code:

df3.explode(df3.assign(titles_split=df3.titles_split.str.split(',')), 'titles_split')

But I got the following error message:

ValueError: column must be a scalar, tuple, or list thereof

The same thing happened when I tried it for titles in df2.

I also tried creating new columns that repeated scores and num_comments as many times as there are words in titles (or titles_split). The idea was to create a dataframe like this:

    In [9]: df4
        Out[9]:
Score  num_comments     titles_split                                    score_repeated
  0  134   518   [Uhaul, implement, nicotine-free, hiring, policy]   134,134,134,134,134,134
  1  28    43    [Orangutan, saves, child, from, a, giant, volcano]     28,28,28,28,28,28,28
  2  30    114   [Swimmer, dies, in, a, horrific, shark, attack, in, harbour] 30,30,30 etc..
  3  745   298   [More, teenagers, than, ever, are, addicted, to, glue]        etc.
  4  40    67    [Lebanese, lawyers, union, accuse, Al, Capone, of, fraud]     etc
  ...
 9366 345  32    [City, of, Louisville, closed, off, this, summer]             etc 
 9367 1200 234   [New, york, rats, stronger, than, ever, reports, say]         etc
 9368 432  123   [Congolese, militia, shipwrecked, in, Norway]                 etc
 9369 594  203   [Scientists, now, agree, on, how, to, use, ice, in, drinks]   etc
 9370 611  153   [Historic, drought, hits, Atlantis]                           etc

And then explode on titles_split, score_repeated and comments_repeated like this:

df4.explode(['titles_split', 'score_repeated', 'comments_repeated'])

But I never got to that point because I couldn't get repeated columns. I tried the following code:

df3['score_repeat'] = df3.apply(lambda x: [x.score] * len(x.titles_split) , axis =1)

Which gave me this error message:

TypeError: object of type 'float' has no len()

Then I tried:

df3['score_repeat'] = [[y] * x for x, y in zip(df3['titles_split'].str.len(),df['score'])]

Which gave me:

TypeError: can't multiply sequence by non-int of type 'float'

But I am not even sure I am going about this the right way. Do I even need to create score_repeated and comments_repeated?

CodePudding user response:

Assume this is the df

  Score num_comments                                              titles
0   134          518         Uhaul implement nicotine-free hiring policy
1    28           43          Orangutan saves child from a giant volcano
2    30          114  Swimmer dies in a horrific shark attack in harbour
3   745          298       More teenagers than ever are addicted to glue
4    40           67    Lebanese lawyers union accuse Al Capone of fraud
5   345           32           City of Louisville closed off this summer
6  1200          234     New york rats "stronger than ever", reports say
7   432          123             Congolese militia shipwrecked in Norway
8   594          203    Scientists now agree on how to use ice in drinks
9   611          153                      Historic drought hits Atlantis

You can try the following:

df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '')    #data cleaning
df['titles'] = df['titles'].str.split()    #split sentence into a list (of single words)
df2 = df.explode('titles', ignore_index=True)
df2.columns = ['score', 'num_comments', 'word']
print(df2)

   score num_comments           word
0    134          518          Uhaul
1    134          518      implement
2    134          518  nicotine-free
3    134          518         hiring
4    134          518         policy
5     28           43      Orangutan
6     28           43          saves
7     28           43          child
8     28           43           from
9     28           43              a
10    28           43          giant
11    28           43        volcano
12    30          114        Swimmer
13    30          114           dies
14    30          114             in
15    30          114              a
16    30          114       horrific
17    30          114          shark
18    30          114         attack
19    30          114             in
20    30          114        harbour
21   745          298           More
22   745          298      teenagers
23   745          298           than
24   745          298           ever
25   745          298            are
26   745          298       addicted
27   745          298             to
28   745          298           glue
29    40           67       Lebanese
30    40           67        lawyers
31    40           67          union
32    40           67         accuse
33    40           67             Al
34    40           67         Capone
35    40           67             of
36    40           67          fraud
37   345           32           City
38   345           32             of
39   345           32     Louisville
40   345           32         closed
41   345           32            off
42   345           32           this
43   345           32         summer
44  1200          234            New
45  1200          234           york
46  1200          234           rats
47  1200          234       stronger
48  1200          234           than
49  1200          234           ever
50  1200          234        reports
51  1200          234            say
52   432          123      Congolese
53   432          123        militia
54   432          123    shipwrecked
55   432          123             in
56   432          123         Norway
57   594          203     Scientists
58   594          203            now
59   594          203          agree
60   594          203             on
61   594          203            how
62   594          203             to
63   594          203            use
64   594          203            ice
65   594          203             in
66   594          203         drinks
67   611          153       Historic
68   611          153        drought
69   611          153           hits
70   611          153       Atlantis

Data cleaning needed

I noticed that there are strings with " (double-quotations) and , (comma), meaning the data is not clean. You could do some data cleaning with the following:

df['titles'] = df['titles'].str.replace('"', '').str.replace(',', '')
  • Related