How to remove duplicated lines within a list of strings using regex in Python?-CodePudding

I have a DataFrame as below

df

Index   Lines

0  /// User states this is causing a problem and but the problem can only be fixed by the user. /// User states this is causing a problem and but the problem can only be fixed by the user.
1  //- How to fix the problem is stated below. Below are the list of solutions to the problem. //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \\ User describes the problem in the problem report.

I want to remove repeated sentences but not the duplicated words.

I tried the following solution but it also removes duplicated words in the process.

df['cleaned'] = (df['lines'].str.split()
                                  .apply(lambda x: OrderedDict.fromkeys(x).keys())
                                  .str.join(' '))

This results in

Index   cleaned

0  /// User states this is causing a problem and but the can only be fixed by user.
1  //- How to fix the problem is stated below. Below are list of solutions problem.
2 \ User describes the problem in report.

But the expected solution is :

Index   cleaned

0  /// User states this is causing a problem and but the problem can only be fixed by the user.
1  //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \\ User describes the problem in the problem report.

How do I get it to remove the repeated lines but not the duplicate words? Is there a way to get this done ?

Is there a way in regex to grab the first sentence ending with a "." and checking if that first sentence appears again in the big string and remove everything from where the first string repeats till the end?

Please advice or suggest. Thanks!!

CodePudding user response：

IIUC:

out = df['Lines'].str.findall(r'[^.] ').explode() \
                 .reset_index().drop_duplicates() \
                 .groupby('Index')['Lines'] \
                 .apply(lambda x: '.'.join(x))

>>> out[0]
 /// User states this is causing a problem and but the problem can only be fixed by the user

>>> out[1]
 //- How to fix the problem is stated below. Below are the list of solutions to the problem

>>> print(out[2])
\\ User describes the problem in the problem report

CodePudding user response：

Since your dataframe is just storing strings, let's just do it manually:

seen = set()
for i, row in enumerate(df["lines"]):
    lines = row.split(". ")
    keep = []
    for line in lines:
        line = line.strip()
            # if you want to clean up
            line = line.strip("\\/-").strip()
        if line[-1] != ".":
            line  = "."
        if line not in seen:
            keep.append(line)
            seen.add(line)
    df["lines"][i] = " ".join(keep)

We iterate the column by row, split every line by ". " (which splits on sentences), and then if the sentence hasn't been seen already, we store it in a list. Then we set the row back to that list, joined up again.

Since the token we split by is removed, we append a "." to every sentence which doesn't end with one.