I have a dataset, which contains anonymous search data from a webservice. I wish to see which search terms, that don't result in a hit, are used most often. However the data is structured in a way I have trouble working with so I need to trim it down first - but the only way I can think of is slow and inefficient, and I don't know enough about it to find a useable answer by searching.
I don't know exactly how the data is collected, but most often each keystroke seems to generate a new row - although sometimes a word appears in the list mostly or fully formed (maybe copy/pasted?).
An example:
date , searchTerm
26-04-2022 13:30:30, d
26-04-2022 13:30:30, de
26-04-2022 13:30:31, dep
26-04-2022 13:30:31, depa
26-04-2022 13:30:31, depar
26-04-2022 13:30:32, depart
26-04-2022 13:30:32, departm
26-04-2022 13:30:33, departme
26-04-2022 13:30:34, departmen
26-04-2022 13:30:34, department
26-04-2022 13:30:34, departmen
26-04-2022 13:30:35, departme
26-04-2022 13:30:36, departm
...
26-04-2022 13:32:11, somethi
26-04-2022 13:32:31, somethin
26-04-2022 13:32:31, something
26-04-2022 13:32:56, another word
...
In this list I'd like to only keep the rows that say 26-04-2022 13:30:34, department
, 26-04-2022 13:32:31, something
and 26-04-2022 13:32:56, another word
but the only way I can think of is to iterate through the list, removing the rows with strings that are contained in adjecent rows.
I'm fairly confident that's not the "correct" way of doing it however. It's not something I have to do often, but if there's a better (or "correct") way to do it I'd rather learn something new.
CodePudding user response:
We can use pandas.Series.shift
to compare rows with adjacent rows, and python builtin operator in
to test for substring inclusion.
import pandas as pd
df = pd.DataFrame({'date': range(7), 'searchTerm': 'departme departmen department departmen somethin something anotherword'.split()})
print(df)
# date searchTerm
# 0 0 departme
# 1 1 departmen
# 2 2 department
# 3 3 departmen
# 4 4 somethin
# 5 5 something
# 6 6 anotherword
goodrows = [(x not in y and x not in z) for x, y, z in zip(df['searchTerm'], df['searchTerm'].shift(-1, fill_value=''), df['searchTerm'].shift(1, fill_value=''))]
print( df[goodrows] )
# date searchTerm
# 2 2 department
# 5 5 something
# 6 6 anotherword
Having to use list comprehensions is a bit frustrating. Pandas allows lots of cool notations to manipulate numbers, but not so much for strings.
For instance:
goodrows = not (df['searchTerm'] <= df['searchTerm'].shift(-1) or df['searchTerm'] <= df['searchTerm'].shift(1))
print( df[goodrows] )
would work if searchTerm were numbers and we wanted to remove rows that are less than an adjacent row.
But we can't replace operator <=
with operator in
or with method str.contains
to get this to work with strings.