How to create DataFrame in Python if values from list are in row of a different DataFrame?-CodePudding

I have a sample dataframe:

| ID | SampleColumn1| SampleColumn2 | SampleColumn3 |
|:-- |:------------:| ------------ :| ------------  |
| 1  |sample Apple  | sample Cherry |sample Lime    |
| 2  |sample Cherry | sample lemon  | sample Grape  |

I would like to create a new DataFrame based off of this initial dataframe. Should one of several values in a list [Apple, Lime, Cherry] be in any of the columns for a row, it would appear as a 1 in the new dataframe for its column. In this case, the output should be:

| ID | Apple | Lime | Cherry |
| 1  |  1    |  1   |    1   |
| 2  |  0    |  0   |    1   |

Currently I have tried in going about in using the find function for a string, transforming a series into a string for each row then using an if condition if the value has returned and equals the column name of the new dataframe. I am getting a logic error in this regard.

CodePudding user response：

You can create a function to replace strings that contain your desired substrings, then use pd.get_dummies()

fruits = ['Apple', 'Lime', 'Cherry']
def replace_fruit(string):
    for fruit in fruits:
        if fruit in string:
            return fruit
    return None

pd.get_dummies(df.set_index('ID').applymap(replace_fruit), prefix='', prefix_sep='').groupby(level=0, axis=1).sum().reset_index()

CodePudding user response：

try this:

keywords = ['Apple', 'Lime', 'Cherry']
tmp = (df.melt(ignore_index=False)
       .value.str.extract(
           f'({"|".join(keywords)})',
           expand=False)
       .dropna())

res = (pd.crosstab(index=tmp.index, columns=tmp)
       .rename_axis(index=None, columns=None))
print(res)
>>>
    Apple   Cherry  Lime
1   1       1       1
2   0       1       0