Home > Mobile >  How to replace "incomplete" tuples in a Pandas DataFrame with np.nan
How to replace "incomplete" tuples in a Pandas DataFrame with np.nan

Time:06-23

I have a DataFrame with several columns like these:

0     (knowledgeable, 0.006922706202287597)
1           (people, 0.0053079601873486145)
2              (just, 0.007235642730624786)
3             (stuff, 0.009834567438203438)
4            (stores, 0.009245883504449756)
5            (worst, 0.0065683703014863875)
6            (helped, 0.013502315692366386)
7        (recommend, 0.0065729286562998725)
8             (things, 0.00650562176223524)
9        (selections, 0.006653774082233169)
10       (experience, 0.006726380158071277)
11              (ok, 0.0014246604031440885)
12           (passed, 0.015648637939820922)
13              (try, 0.008028511624942813)
14         (disabled, 0.009670770697095545)
15               (day, 0.00703846626767429)
16            (biligt, 0.02647689466720133)
17         (checkout, 0.012055180332875096)
18            (stood, 0.009159122828925005)
19           (screen, 0.007820125838899874)
20      (recommended, 0.006226309971548994)
21               (far, 0.01239155021058053)
22              (day, 0.008949285608126105)
23             (neat, 0.009105447278449122)
24         (handling, 0.010347821731508472)
25      (hairdresser, 0.008140116039722884)
26           (helped, 0.008970523437221692)
27            (quite, 0.007926756509831526)
28            (order, 0.011485957263052248)
29           (bought, 0.013794723406541613)
30                                (, 1e-05)
31             (poor, 0.013266724386702719)
32             (model, 0.00956998789440704)
33       (department, 0.002812963969232889)
34            (staff, 0.008911761468064975)
35         (operation, 0.01409836946318837)
36      (information, 0.009767055759466813)
37            (rails, 0.008838332416985936)
38             (day, 0.0052219417045371135)
39          (waiting, 0.009414550819917716)
40         (airfryer, 0.007329889734030355)
41                                (, 1e-05)
42             (peder, 0.02089218043359051)
43         (employee, 0.008650861558206924)
44           (little, 0.007812544167289761)
45              (mat, 0.018498005518383084)
46             (told, 0.008641486840503518)
47         (activity, 0.009032860019701293)
48          (records, 0.011937516774033033)
49             (venue, 0.01232830407941562)
50             (thim, 0.010341279947061523)
51          (service, 0.009800453069738849)
52            (good, 0.0011770899750682172)
53               (ok, 0.004188977605426055)
54            (right, 0.008547457983148257)
55            (felix, 0.009809746655996321)
56              (need, 0.01028684763722511)
57            (worth, 0.009302224107224836)
58           (repair, 0.008557610145376192)
59        (department, 0.00894549205381644)

Notice the value at indices 30 and 41 are "incomplete" tuples: (, 1e-05).

The number of columns is dynamic. I do not know the column names in advance, but the column labels are dynamically prefixed with word_score_ followed by the column number.

I want to replace all (, 1e-05) values with np.nan.

This is my best attempt so far:

t = "", 1e-05

print(t)  # => ('', 1e-05)
print(type(t))  # => <class 'tuple'>
print(type(t[0]))  # => <class 'str'>
print(type(t[1]))  # => <class 'float'>

df.replace(to_replace=t, value=np.nan, inplace=True)

But this does not replace the values for some reason.

I have verified that the data types in the DataFrame are correct:

c = get_tidy_dataframe()["word_score_9"]
t = c[30]

print(c.dtype)  # => object
print(t)  # => ('', 1e-05)
print(type(t))  # => <class 'tuple'>
print(type(t[0]))  # => <class 'str'>
print(type(t[1]))  # => <class 'float'>

How can I replace those incomplete tuples in my DataFrame?

Dataframe constructor:

d = {'word_score_1': [('knowledgeable', 0.006922706202287597), ('people', 0.0053079601873486145), ('just', 0.007235642730624786), ('stuff', 0.009834567438203438), ('stores', 0.009245883504449756), ('worst', 0.0065683703014863875), ('helped', 0.013502315692366386), ('recommend', 0.0065729286562998725), ('things', 0.00650562176223524), ('selections', 0.006653774082233169), ('experience', 0.006726380158071277), ('ok', 0.0014246604031440885), ('passed', 0.015648637939820922), ('try', 0.008028511624942813), ('disabled', 0.009670770697095545), ('day', 0.00703846626767429), ('biligt', 0.02647689466720133), ('checkout', 0.012055180332875096), ('stood', 0.009159122828925005), ('screen', 0.007820125838899874), ('recommended', 0.006226309971548994), ('far', 0.01239155021058053), ('day', 0.008949285608126105), ('neat', 0.009105447278449122), ('handling', 0.010347821731508472), ('hairdresser', 0.008140116039722884), ('helped', 0.008970523437221692), ('quite', 0.007926756509831526), ('order', 0.011485957263052248), ('bought', 0.013794723406541613), ('', 1e-05), ('poor', 0.013266724386702719), ('model', 0.00956998789440704), ('department', 0.002812963969232889), ('staff', 0.008911761468064975), ('operation', 0.01409836946318837), ('information', 0.009767055759466813), ('rails', 0.008838332416985936), ('day', 0.0052219417045371135), ('waiting', 0.009414550819917716), ('airfryer', 0.007329889734030355), ('', 1e-05), ('peder', 0.02089218043359051), ('employee', 0.008650861558206924), ('little', 0.007812544167289761), ('mat', 0.018498005518383084), ('told', 0.008641486840503518), ('activity', 0.009032860019701293), ('records', 0.011937516774033033), ('venue', 0.01232830407941562), ('thim', 0.010341279947061523), ('service', 0.009800453069738849), ('good', 0.0011770899750682172), ('ok', 0.004188977605426055), ('right', 0.008547457983148257), ('felix', 0.009809746655996321), ('need', 0.01028684763722511), ('worth', 0.009302224107224836), ('repair', 0.008557610145376192), ('department', 0.00894549205381644)]}
df = pd.DataFrame(d)

CodePudding user response:

You could just use a list comprehension:

df = pd.DataFrame({'col': [(1, 2), (3, 4),  ("", 3), (0, ), (0, 3), ("", ""), (3, ), (9, "")]})
df['col'] = [np.nan if ("" in i or len(i) == 1) else i for i in df.col]
df

--------------------------
    col
0   (1, 2)
1   (3, 4)
2   NaN
3   NaN
4   (0, 3)
5   NaN
6   NaN
7   NaN
--------------------------

CodePudding user response:

IIUC, you can use:

def cleanup(sr):
    sr[sr.str[0] == ''] = np.nan
    return sr

cols = df.filter(like='word_score_').columns 
out[cols] = df[cols].apply(cleanup)

Output:

>>> out
                             word_score_1
0   (knowledgeable, 0.006922706202287597)
1         (people, 0.0053079601873486145)
2            (just, 0.007235642730624786)
3           (stuff, 0.009834567438203438)
4          (stores, 0.009245883504449756)
5          (worst, 0.0065683703014863875)
6          (helped, 0.013502315692366386)
7      (recommend, 0.0065729286562998725)
8           (things, 0.00650562176223524)
9      (selections, 0.006653774082233169)
10     (experience, 0.006726380158071277)
11            (ok, 0.0014246604031440885)
12         (passed, 0.015648637939820922)
13            (try, 0.008028511624942813)
14       (disabled, 0.009670770697095545)
15             (day, 0.00703846626767429)
16          (biligt, 0.02647689466720133)
17       (checkout, 0.012055180332875096)
18          (stood, 0.009159122828925005)
19         (screen, 0.007820125838899874)
20    (recommended, 0.006226309971548994)
21             (far, 0.01239155021058053)
22            (day, 0.008949285608126105)
23           (neat, 0.009105447278449122)
24       (handling, 0.010347821731508472)
25    (hairdresser, 0.008140116039722884)
26         (helped, 0.008970523437221692)
27          (quite, 0.007926756509831526)
28          (order, 0.011485957263052248)
29         (bought, 0.013794723406541613)
30                                    NaN
31           (poor, 0.013266724386702719)
32           (model, 0.00956998789440704)
33     (department, 0.002812963969232889)
34          (staff, 0.008911761468064975)
35       (operation, 0.01409836946318837)
36    (information, 0.009767055759466813)
37          (rails, 0.008838332416985936)
38           (day, 0.0052219417045371135)
39        (waiting, 0.009414550819917716)
40       (airfryer, 0.007329889734030355)
41                                    NaN
42           (peder, 0.02089218043359051)
43       (employee, 0.008650861558206924)
44         (little, 0.007812544167289761)
45            (mat, 0.018498005518383084)
46           (told, 0.008641486840503518)
47       (activity, 0.009032860019701293)
48        (records, 0.011937516774033033)
49           (venue, 0.01232830407941562)
50           (thim, 0.010341279947061523)
51        (service, 0.009800453069738849)
52          (good, 0.0011770899750682172)
53             (ok, 0.004188977605426055)
54          (right, 0.008547457983148257)
55          (felix, 0.009809746655996321)
56            (need, 0.01028684763722511)
57          (worth, 0.009302224107224836)
58         (repair, 0.008557610145376192)
59      (department, 0.00894549205381644)

CodePudding user response:

another way using .loc with .explode

df.loc[df['word_score_1'].explode()\
  .replace('',np.nan).groupby(level=0).nunique() < 2 ,'word_score_1'] = np.nan


                             word_score_1
0   (knowledgeable, 0.006922706202287597)
1         (people, 0.0053079601873486145)
2            (just, 0.007235642730624786)
3           (stuff, 0.009834567438203438)
4          (stores, 0.009245883504449756)
5          (worst, 0.0065683703014863875)
6          (helped, 0.013502315692366386)
7      (recommend, 0.0065729286562998725)
8           (things, 0.00650562176223524)
9      (selections, 0.006653774082233169)
10     (experience, 0.006726380158071277)
11            (ok, 0.0014246604031440885)
12         (passed, 0.015648637939820922)
13            (try, 0.008028511624942813)
14       (disabled, 0.009670770697095545)
15             (day, 0.00703846626767429)
16          (biligt, 0.02647689466720133)
17       (checkout, 0.012055180332875096)
18          (stood, 0.009159122828925005)
19         (screen, 0.007820125838899874)
20    (recommended, 0.006226309971548994)
21             (far, 0.01239155021058053)
22            (day, 0.008949285608126105)
23           (neat, 0.009105447278449122)
24       (handling, 0.010347821731508472)
25    (hairdresser, 0.008140116039722884)
26         (helped, 0.008970523437221692)
27          (quite, 0.007926756509831526)
28          (order, 0.011485957263052248)
29         (bought, 0.013794723406541613)
30                                    NaN
31           (poor, 0.013266724386702719)
32           (model, 0.00956998789440704)
33     (department, 0.002812963969232889)
34          (staff, 0.008911761468064975)
35       (operation, 0.01409836946318837)
36    (information, 0.009767055759466813)
37          (rails, 0.008838332416985936)
38           (day, 0.0052219417045371135)
39        (waiting, 0.009414550819917716)
40       (airfryer, 0.007329889734030355)
41                                    NaN
42           (peder, 0.02089218043359051)
43       (employee, 0.008650861558206924)
44         (little, 0.007812544167289761)
45            (mat, 0.018498005518383084)
46           (told, 0.008641486840503518)
47       (activity, 0.009032860019701293)
48        (records, 0.011937516774033033)
49           (venue, 0.01232830407941562)
50           (thim, 0.010341279947061523)
51        (service, 0.009800453069738849)
52          (good, 0.0011770899750682172)
53             (ok, 0.004188977605426055)
54          (right, 0.008547457983148257)
55          (felix, 0.009809746655996321)
56            (need, 0.01028684763722511)
57          (worth, 0.009302224107224836)
58         (repair, 0.008557610145376192)
59      (department, 0.00894549205381644)
  • Related