Most efficient way to clean a string-CodePudding

I have a huge dataset and I'm removing rows where the length of a string is less than or equal to 1 word:

a=a[~a['text'].str.split().str.len().le(1)]

The code above works but takes 18min to complete.

Is there a most efficient way to do the same task?

CodePudding user response：

You could use a list comprehension:

a[[len(x.split())>1 for x in a['text']]]

or map:

one = 1
a[[*map(one.__lt__, map(len, map(str.split, a['text'])))]]

Some benchmarks:

a = pd.DataFrame({'text': ['one word', 'two','three words ok']*100000})

>>> %timeit -n 10 a[~a['text'].str.split().str.len().le(1)]
328 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[a['text'].str.split().str.len().gt(1)]
325 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[a['text'].str.count(' ').gt(0)]
288 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[[len(x.split())>1 for x in a['text']]]
168 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit -n 10 a[[*map(one.__lt__, map(len, map(str.split, a['text'])))]]
134 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

In your case just count blank

a[a['text'].str.count(' ') >= 1]