Home > Enterprise >  Extracting specific substrings (recurring elements) from a longer string in Python
Extracting specific substrings (recurring elements) from a longer string in Python

Time:10-10

I have a large table of some survey answers which is saved in a peculiar format, with several sections each of which contains parts. If imported as pandas dataframe, sections become separate columns (from a csv file). As an example, two such columns would be:

from io import StringIO import pandas as pd

from io import StringIO
import pandas as pd

INDATA = StringIO('''Response 1,Response 2
    part 1: A; part 2: B; part 3: A; part 4: B; part 5: B,part 1: A; part 2: B; part 3: B; part 4: A
    ''')

df = pd.read_csv(INDATA, sep=',')

How could I end with a table like this, with individual elements:

| R1 | R2 | R3 | ... | R9 |
| –- | –- | –- | –-- | –- |
|  A |  B |  A | ... |  A |

or even like this, with list of elements:

| Response1       | Response2    |
| [A, B, A, B, B] | [A, B, B, A] |

Would re work, or is there something better? Thanks!

CodePudding user response:

This will work

df = pd.read_csv(INDATA, sep=",").transpose()
df[0] = df[0].str.split("; ").apply(lambda x: [e.strip().split(":")[1].strip() for e in x])
df = df.transpose()

print(df)

        Response 1    Response 2
0  [A, B, A, B, B]  [A, B, B, A]

CodePudding user response:

You can stack, extract, unstack:

out = df.stack().str.extractall(':\s*(\w );?')[0].unstack([-2, -1])

output:

      Response 1             Response 2         
match          0  1  2  3  4          0  1  2  3
0              A  B  A  B  B          A  B  B  A

If you also need the 'part' number:

out = (df
 .stack().str.extractall('part\s*(?P<part>\w ):\s*(?P<answer>\w );?')
 .set_index('part', append=True).droplevel('match')['answer']
 .unstack([-2, -1])
)

output:

     Response 1             Response 2         
part          1  2  3  4  5          1  2  3  4
0             A  B  A  B  B          A  B  B  A

For a list:

out = (df
 .stack().str.extractall(':\s*(\w );?')[0]
 .groupby(level=[0, 1]).agg(list).unstack()
)

output:

        Response 1    Response 2
0  [A, B, A, B, B]  [A, B, B, A]

CodePudding user response:

Another possible solution:

(df.apply(lambda x: x.str.split('; '))
 .explode(['Response 1']).explode(['Response 2'])
 .melt().drop_duplicates()
 .pivot_table(columns='variable', values='value', 
              aggfunc=lambda z: z.str.replace(r'part \d : ', '', regex=True))
)

Output:

variable       Response 1    Response 2
value     [A, B, A, B, B]  [A, B, B, A]
  • Related