I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with a few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items'])
but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin()
. Any other ways? Especially ignoring case sensitivity of pattern.
CodePudding user response:
You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] ' ' df['items'].str.split(" ").str[2]),case=False)
#get second and third words