I have a huge dataset and I'm removing rows where the length of a string is less than or equal to 1 word:
a=a[~a['text'].str.split().str.len().le(1)]
The code above works but takes 18min to complete.
Is there a most efficient way to do the same task?
CodePudding user response:
You could use a list comprehension:
a[[len(x.split())>1 for x in a['text']]]
or map
:
one = 1
a[[*map(one.__lt__, map(len, map(str.split, a['text'])))]]
Some benchmarks:
a = pd.DataFrame({'text': ['one word', 'two','three words ok']*100000})
>>> %timeit -n 10 a[~a['text'].str.split().str.len().le(1)]
328 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[a['text'].str.split().str.len().gt(1)]
325 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[a['text'].str.count(' ').gt(0)]
288 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[[len(x.split())>1 for x in a['text']]]
168 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[[*map(one.__lt__, map(len, map(str.split, a['text'])))]]
134 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
In your case just count
blank
a[a['text'].str.count(' ') >= 1]