I have a dataframe where sig column has string where it comes as single or group of string. If i want to extract only dataframe where my string matching how can I do that?
df = pd.DataFrame()
df['sig'] = ['301','302','303','405','405 409','302 301','303 301','405','560','309','301 302 303','303 304','305','305 304 303 302 301', '555','565','301 302 303 304']
df['val'] = [45,25,1,54,12,51,52,1,4,152,786,145,1,555,7595,77,89]
if I add string '301 302 303'
to find the same match and where it contains combinations but
it also gives the additional rows where it contains 301 302 303 304
or like that
I tried this way
dtc = '301 302 303'
dtcs = dtc.split()
x = df['sig'].str.contains(dtcs[0])
for i in range (1, len(dtcs)):
x = x | df['sig'].str.contains(dtcs[i])
df_dtc = df[x]
print(df_dtc)
print('****************')
which gives the ouput like this
# output
sig val
0 301 45
1 302 25
5 302 301 51
6 303 301 52
10 301 302 303 786
13 305 304 303 302 301 555
16 301 302 303 304 89
****************
sig val
0 301 45
1 302 25
2 303 1
5 302 301 51
6 303 301 52
10 301 302 303 786
11 303 304 145
13 305 304 303 302 301 555
16 301 302 303 304 89
expected output
output
sig val
301 45
302 25
303 1
302 301 51
303 301 52
301 302 303 786
it means all its combination and without repetations and getting only exact or combinations of that. as you can see it is not adding rows
303 304
305 304 303 302 301
301 302 303 304 etc.....
How can I do this?
CodePudding user response:
The following is a bit ugly but does what I believe is what you are looking for...
# Convert 'sig' to a set
df['sig_as_set'] = df['sig'].str.split().map(set)
# same for 'dtc'
dtc_as_set = {'301', '302', '303'}
# Locate rows of interest (where sig_of_set must be subset of dtc_as_set)
df['row_of_interest'] = df['sig_as_set'].apply(lambda sos: sos.issubset(dtcs_as_set))
df.query('row_of_interest')
returns...
sig val sig_as_set row_of_interest
0 301 45 {301} True
1 302 25 {302} True
2 303 1 {303} True
5 302 301 51 {301, 302} True
6 303 301 52 {301, 303} True
10 301 302 303 786 {301, 303, 302} True
This assumes sig is non-empty.
CodePudding user response:
Just first split
the string then uses itertools.permutations
and concatenate the data frame where the string is matching.
dt = '301 302 303'
import itertools
list1 = dt.split()
parent_df = pd.DataFrame()
for i in range (1, len(list1) 1):
for subset in itertools.permutations(list1, i):
a = ' '.join(subset)
df_dtc = df[df['sig'] == a]
parent_df = pd.concat([parent_df,df_dtc], axis=0)
del(df_dtc)
print(parent_df)
The expected output of this is following
#output
sig val
0 301 45
1 302 25
2 303 1
6 301 303 52
5 302 301 51
10 301 302 303 786