I have words in a cell mixed with word I want to extract. How do I extract the words or delete the w-CodePudding

Hello I'm looking for a solution in Pandas or excel. I have a spread sheet with a column that contain words separated by a semicolon

apple - slice123; banana; apple - slice321; orange; citron; apple - slice345;

I want to extract "banana" and "orange" and "citron" into a new column.

I looked for tokenization and pandas extract with word list but I didn't not find a solution.

My original csv contains 1058 rows and the column in question has 1 correct word (orange etc) and 1 error (apple - sliceXYZ) but also 5 correct words and up to 100 errors.

I hope someone has an idea how to solve this.

Edit for clarification.

I have 1027 rows in in the table but only the column with the data of the "fruits" is relevant. I know that I have 27 different fruits somewhere in the columns

Edit: I added a html table for clarification. The word list is used to identify the relevant "fruits" out of the column data and tells me in the results which of the fruits was used in the column data.

<style>
  table, th, td {
  border: 1px solid black;
  border-collapse: collapse; padding: 15px;
}
</style>



<table>
  <tr>
    <td><p><strong>Colunmn 1</strong></p></td>
    <td><p><strong>Colunmn 2</strong></p></td>
    <td><p><strong>Data</strong></p></td>
    <td><p><strong>Result</strong></p></td>
  </tr>

  <tr>
    <td><p>not relevant</p></td>
    <td><p>not relevant</p></td>
    <td><p>apple - slice123; banana; apple - slice321; orange; citron; apple - slice345</p></td>
    <td><p>banana; orange; citron</p></td>
  </tr>
  <tr>
    <td><p>not relevant</p></td>
    <td><p>not relevant</p></td>
    <td><p>apple - slice435; banana; apple - slice687; orange; citron; apple - slice334; mango; papaya</p></td>
    <td><p>banana; orange; citron; mango; papaya</p></td>
  </tr>
</table>

<p></p>

<table>

  <tr>
    <td><p> <strong>word list</strong><p></td>
  </tr>
  <tr>
    <td><p>banana</p></td>
  </tr>
  <tr>
    <td><p>orange<p></td>
  </tr>
  <tr>
    <td><p>citron<p></td>
  </tr>
  <tr>
    <td><p>mango<p></td>
  </tr>
  <tr>
    <td><p>papaya<p></td>
  </tr>
  
</table>

CodePudding user response：

IIUC you can do it like this:

df = pd.DataFrame(
    {
        "Col2": ["not relevant", "not relevant"],
        "Data": [
            "apple - slice123; banana; apple - slice321; orange; citron; apple - slice345;",
            "apple - slice435; banana; apple - slice687; orange; citron; apple - slice334; mango; papaya",
        ],
    }
)
word_list = ['banana', 'orange', 'citron', 'mango', 'papaya']

two options:
a) 
df["Result"] = df["Data"].map(
    lambda s: ";".join(
        filter(None, [x.strip() for x in s.split(";") if "slice" not in x])
    )
)

b)
df["Result"] = df["Data"].map(
    lambda s: ";".join(
        filter(None, [x.strip() for x in s.split(";") if x.strip() in word_list])
    )
)

print(df['Result']

0                 banana;orange;citron
1    banana;orange;citron;mango;papaya
Name: Result, dtype: object

CodePudding user response：

If you have Excel 2019 (or greater) and the first cell with data is A1 you could use

=TEXTJOIN("; ",TRUE,(FILTERXML("<c><e>"&SUBSTITUTE(A1,";","</e><e>")&"</e></c>","//e[node() and not(contains(., '-'))]")))

If you have Excel 2013 to 2016 then you could just use the FILTERXML() portion of the above, but it would have to be entered as an array formula, e.g. select cells B1:D1, enter the formula in the formula bar, and press CTRL Shift Enter to confirm it (you're selecting 3 cells because you expect to have 3 results)