I have a DataFrame with several columns like these:
0 (knowledgeable, 0.006922706202287597)
1 (people, 0.0053079601873486145)
2 (just, 0.007235642730624786)
3 (stuff, 0.009834567438203438)
4 (stores, 0.009245883504449756)
5 (worst, 0.0065683703014863875)
6 (helped, 0.013502315692366386)
7 (recommend, 0.0065729286562998725)
8 (things, 0.00650562176223524)
9 (selections, 0.006653774082233169)
10 (experience, 0.006726380158071277)
11 (ok, 0.0014246604031440885)
12 (passed, 0.015648637939820922)
13 (try, 0.008028511624942813)
14 (disabled, 0.009670770697095545)
15 (day, 0.00703846626767429)
16 (biligt, 0.02647689466720133)
17 (checkout, 0.012055180332875096)
18 (stood, 0.009159122828925005)
19 (screen, 0.007820125838899874)
20 (recommended, 0.006226309971548994)
21 (far, 0.01239155021058053)
22 (day, 0.008949285608126105)
23 (neat, 0.009105447278449122)
24 (handling, 0.010347821731508472)
25 (hairdresser, 0.008140116039722884)
26 (helped, 0.008970523437221692)
27 (quite, 0.007926756509831526)
28 (order, 0.011485957263052248)
29 (bought, 0.013794723406541613)
30 (, 1e-05)
31 (poor, 0.013266724386702719)
32 (model, 0.00956998789440704)
33 (department, 0.002812963969232889)
34 (staff, 0.008911761468064975)
35 (operation, 0.01409836946318837)
36 (information, 0.009767055759466813)
37 (rails, 0.008838332416985936)
38 (day, 0.0052219417045371135)
39 (waiting, 0.009414550819917716)
40 (airfryer, 0.007329889734030355)
41 (, 1e-05)
42 (peder, 0.02089218043359051)
43 (employee, 0.008650861558206924)
44 (little, 0.007812544167289761)
45 (mat, 0.018498005518383084)
46 (told, 0.008641486840503518)
47 (activity, 0.009032860019701293)
48 (records, 0.011937516774033033)
49 (venue, 0.01232830407941562)
50 (thim, 0.010341279947061523)
51 (service, 0.009800453069738849)
52 (good, 0.0011770899750682172)
53 (ok, 0.004188977605426055)
54 (right, 0.008547457983148257)
55 (felix, 0.009809746655996321)
56 (need, 0.01028684763722511)
57 (worth, 0.009302224107224836)
58 (repair, 0.008557610145376192)
59 (department, 0.00894549205381644)
Notice the value at indices 30 and 41 are "incomplete" tuples: (, 1e-05)
.
The number of columns is dynamic. I do not know the column names in advance, but the column labels are dynamically prefixed with word_score_
followed by the column number.
I want to replace all (, 1e-05)
values with np.nan
.
This is my best attempt so far:
t = "", 1e-05
print(t) # => ('', 1e-05)
print(type(t)) # => <class 'tuple'>
print(type(t[0])) # => <class 'str'>
print(type(t[1])) # => <class 'float'>
df.replace(to_replace=t, value=np.nan, inplace=True)
But this does not replace the values for some reason.
I have verified that the data types in the DataFrame are correct:
c = get_tidy_dataframe()["word_score_9"]
t = c[30]
print(c.dtype) # => object
print(t) # => ('', 1e-05)
print(type(t)) # => <class 'tuple'>
print(type(t[0])) # => <class 'str'>
print(type(t[1])) # => <class 'float'>
How can I replace those incomplete tuples in my DataFrame?
Dataframe constructor:
d = {'word_score_1': [('knowledgeable', 0.006922706202287597), ('people', 0.0053079601873486145), ('just', 0.007235642730624786), ('stuff', 0.009834567438203438), ('stores', 0.009245883504449756), ('worst', 0.0065683703014863875), ('helped', 0.013502315692366386), ('recommend', 0.0065729286562998725), ('things', 0.00650562176223524), ('selections', 0.006653774082233169), ('experience', 0.006726380158071277), ('ok', 0.0014246604031440885), ('passed', 0.015648637939820922), ('try', 0.008028511624942813), ('disabled', 0.009670770697095545), ('day', 0.00703846626767429), ('biligt', 0.02647689466720133), ('checkout', 0.012055180332875096), ('stood', 0.009159122828925005), ('screen', 0.007820125838899874), ('recommended', 0.006226309971548994), ('far', 0.01239155021058053), ('day', 0.008949285608126105), ('neat', 0.009105447278449122), ('handling', 0.010347821731508472), ('hairdresser', 0.008140116039722884), ('helped', 0.008970523437221692), ('quite', 0.007926756509831526), ('order', 0.011485957263052248), ('bought', 0.013794723406541613), ('', 1e-05), ('poor', 0.013266724386702719), ('model', 0.00956998789440704), ('department', 0.002812963969232889), ('staff', 0.008911761468064975), ('operation', 0.01409836946318837), ('information', 0.009767055759466813), ('rails', 0.008838332416985936), ('day', 0.0052219417045371135), ('waiting', 0.009414550819917716), ('airfryer', 0.007329889734030355), ('', 1e-05), ('peder', 0.02089218043359051), ('employee', 0.008650861558206924), ('little', 0.007812544167289761), ('mat', 0.018498005518383084), ('told', 0.008641486840503518), ('activity', 0.009032860019701293), ('records', 0.011937516774033033), ('venue', 0.01232830407941562), ('thim', 0.010341279947061523), ('service', 0.009800453069738849), ('good', 0.0011770899750682172), ('ok', 0.004188977605426055), ('right', 0.008547457983148257), ('felix', 0.009809746655996321), ('need', 0.01028684763722511), ('worth', 0.009302224107224836), ('repair', 0.008557610145376192), ('department', 0.00894549205381644)]}
df = pd.DataFrame(d)
CodePudding user response:
You could just use a list comprehension:
df = pd.DataFrame({'col': [(1, 2), (3, 4), ("", 3), (0, ), (0, 3), ("", ""), (3, ), (9, "")]})
df['col'] = [np.nan if ("" in i or len(i) == 1) else i for i in df.col]
df
--------------------------
col
0 (1, 2)
1 (3, 4)
2 NaN
3 NaN
4 (0, 3)
5 NaN
6 NaN
7 NaN
--------------------------
CodePudding user response:
IIUC, you can use:
def cleanup(sr):
sr[sr.str[0] == ''] = np.nan
return sr
cols = df.filter(like='word_score_').columns
out[cols] = df[cols].apply(cleanup)
Output:
>>> out
word_score_1
0 (knowledgeable, 0.006922706202287597)
1 (people, 0.0053079601873486145)
2 (just, 0.007235642730624786)
3 (stuff, 0.009834567438203438)
4 (stores, 0.009245883504449756)
5 (worst, 0.0065683703014863875)
6 (helped, 0.013502315692366386)
7 (recommend, 0.0065729286562998725)
8 (things, 0.00650562176223524)
9 (selections, 0.006653774082233169)
10 (experience, 0.006726380158071277)
11 (ok, 0.0014246604031440885)
12 (passed, 0.015648637939820922)
13 (try, 0.008028511624942813)
14 (disabled, 0.009670770697095545)
15 (day, 0.00703846626767429)
16 (biligt, 0.02647689466720133)
17 (checkout, 0.012055180332875096)
18 (stood, 0.009159122828925005)
19 (screen, 0.007820125838899874)
20 (recommended, 0.006226309971548994)
21 (far, 0.01239155021058053)
22 (day, 0.008949285608126105)
23 (neat, 0.009105447278449122)
24 (handling, 0.010347821731508472)
25 (hairdresser, 0.008140116039722884)
26 (helped, 0.008970523437221692)
27 (quite, 0.007926756509831526)
28 (order, 0.011485957263052248)
29 (bought, 0.013794723406541613)
30 NaN
31 (poor, 0.013266724386702719)
32 (model, 0.00956998789440704)
33 (department, 0.002812963969232889)
34 (staff, 0.008911761468064975)
35 (operation, 0.01409836946318837)
36 (information, 0.009767055759466813)
37 (rails, 0.008838332416985936)
38 (day, 0.0052219417045371135)
39 (waiting, 0.009414550819917716)
40 (airfryer, 0.007329889734030355)
41 NaN
42 (peder, 0.02089218043359051)
43 (employee, 0.008650861558206924)
44 (little, 0.007812544167289761)
45 (mat, 0.018498005518383084)
46 (told, 0.008641486840503518)
47 (activity, 0.009032860019701293)
48 (records, 0.011937516774033033)
49 (venue, 0.01232830407941562)
50 (thim, 0.010341279947061523)
51 (service, 0.009800453069738849)
52 (good, 0.0011770899750682172)
53 (ok, 0.004188977605426055)
54 (right, 0.008547457983148257)
55 (felix, 0.009809746655996321)
56 (need, 0.01028684763722511)
57 (worth, 0.009302224107224836)
58 (repair, 0.008557610145376192)
59 (department, 0.00894549205381644)
CodePudding user response:
another way using .loc
with .explode
df.loc[df['word_score_1'].explode()\
.replace('',np.nan).groupby(level=0).nunique() < 2 ,'word_score_1'] = np.nan
word_score_1
0 (knowledgeable, 0.006922706202287597)
1 (people, 0.0053079601873486145)
2 (just, 0.007235642730624786)
3 (stuff, 0.009834567438203438)
4 (stores, 0.009245883504449756)
5 (worst, 0.0065683703014863875)
6 (helped, 0.013502315692366386)
7 (recommend, 0.0065729286562998725)
8 (things, 0.00650562176223524)
9 (selections, 0.006653774082233169)
10 (experience, 0.006726380158071277)
11 (ok, 0.0014246604031440885)
12 (passed, 0.015648637939820922)
13 (try, 0.008028511624942813)
14 (disabled, 0.009670770697095545)
15 (day, 0.00703846626767429)
16 (biligt, 0.02647689466720133)
17 (checkout, 0.012055180332875096)
18 (stood, 0.009159122828925005)
19 (screen, 0.007820125838899874)
20 (recommended, 0.006226309971548994)
21 (far, 0.01239155021058053)
22 (day, 0.008949285608126105)
23 (neat, 0.009105447278449122)
24 (handling, 0.010347821731508472)
25 (hairdresser, 0.008140116039722884)
26 (helped, 0.008970523437221692)
27 (quite, 0.007926756509831526)
28 (order, 0.011485957263052248)
29 (bought, 0.013794723406541613)
30 NaN
31 (poor, 0.013266724386702719)
32 (model, 0.00956998789440704)
33 (department, 0.002812963969232889)
34 (staff, 0.008911761468064975)
35 (operation, 0.01409836946318837)
36 (information, 0.009767055759466813)
37 (rails, 0.008838332416985936)
38 (day, 0.0052219417045371135)
39 (waiting, 0.009414550819917716)
40 (airfryer, 0.007329889734030355)
41 NaN
42 (peder, 0.02089218043359051)
43 (employee, 0.008650861558206924)
44 (little, 0.007812544167289761)
45 (mat, 0.018498005518383084)
46 (told, 0.008641486840503518)
47 (activity, 0.009032860019701293)
48 (records, 0.011937516774033033)
49 (venue, 0.01232830407941562)
50 (thim, 0.010341279947061523)
51 (service, 0.009800453069738849)
52 (good, 0.0011770899750682172)
53 (ok, 0.004188977605426055)
54 (right, 0.008547457983148257)
55 (felix, 0.009809746655996321)
56 (need, 0.01028684763722511)
57 (worth, 0.009302224107224836)
58 (repair, 0.008557610145376192)
59 (department, 0.00894549205381644)