Home > Software engineering >  pandas: extracting dataframe using combination logic on dataframe's column contains string with
pandas: extracting dataframe using combination logic on dataframe's column contains string with

Time:12-09

I have a dataframe where sig column has string where it comes as single or group of string. If i want to extract only dataframe where my string matching how can I do that?

df = pd.DataFrame()
df['sig'] = ['301','302','303','405','405 409','302 301','303 301','405','560','309','301 302 303','303 304','305','305 304 303 302 301', '555','565','301 302 303 304']
df['val'] = [45,25,1,54,12,51,52,1,4,152,786,145,1,555,7595,77,89]

if I add string '301 302 303' to find the same match and where it contains combinations but it also gives the additional rows where it contains 301 302 303 304 or like that I tried this way

dtc = '301 302 303'
dtcs = dtc.split()
x = df['sig'].str.contains(dtcs[0])
for i in range (1, len(dtcs)):
    x = x | df['sig'].str.contains(dtcs[i])
    df_dtc = df[x]
    print(df_dtc)
    print('****************')

which gives the ouput like this

# output
                    sig  val
0                   301   45
1                   302   25
5               302 301   51
6               303 301   52
10          301 302 303  786
13  305 304 303 302 301  555
16      301 302 303 304   89
****************
                    sig  val
0                   301   45
1                   302   25
2                   303    1
5               302 301   51
6               303 301   52
10          301 302 303  786
11              303 304  145
13  305 304 303 302 301  555
16      301 302 303 304   89

expected output

output 

sig           val

301            45
302            25
303             1
302 301        51 
303 301        52
301 302 303   786

it means all its combination and without repetations and getting only exact or combinations of that. as you can see it is not adding rows

303 304
305 304 303 302 301
301 302 303 304 etc.....

How can I do this?

CodePudding user response:

The following is a bit ugly but does what I believe is what you are looking for...

# Convert 'sig' to a set
df['sig_as_set'] = df['sig'].str.split().map(set)

# same for 'dtc'
dtc_as_set = {'301', '302', '303'}

# Locate rows of interest (where sig_of_set must be subset of dtc_as_set)
df['row_of_interest'] = df['sig_as_set'].apply(lambda sos: sos.issubset(dtcs_as_set))

df.query('row_of_interest') returns...

    sig         val sig_as_set      row_of_interest
0   301          45 {301}           True
1   302          25 {302}           True
2   303           1 {303}           True
5   302 301      51 {301, 302}      True
6   303 301      52 {301, 303}      True
10  301 302 303 786 {301, 303, 302} True

This assumes sig is non-empty.

CodePudding user response:

Just first split the string then uses itertools.permutations and concatenate the data frame where the string is matching.

dt = '301 302 303' 
import itertools
list1 = dt.split()
parent_df = pd.DataFrame()
for i in range (1, len(list1)   1):
    for subset in itertools.permutations(list1, i):
        a = ' '.join(subset)
        df_dtc = df[df['sig'] == a]
        parent_df = pd.concat([parent_df,df_dtc], axis=0)
        del(df_dtc)
        
print(parent_df)

The expected output of this is following

#output

            sig  val
0           301   45
1           302   25
2           303    1
6       301 303   52
5       302 301   51
10  301 302 303  786
  • Related