I have a large table of some survey answers which is saved in a peculiar format, with several sections each of which contains parts. If imported as pandas
dataframe, sections become separate columns (from a csv file). As an example, two such columns would be:
from io import StringIO import pandas as pd
from io import StringIO
import pandas as pd
INDATA = StringIO('''Response 1,Response 2
part 1: A; part 2: B; part 3: A; part 4: B; part 5: B,part 1: A; part 2: B; part 3: B; part 4: A
''')
df = pd.read_csv(INDATA, sep=',')
How could I end with a table like this, with individual elements:
| R1 | R2 | R3 | ... | R9 |
| –- | –- | –- | –-- | –- |
| A | B | A | ... | A |
or even like this, with list of elements:
| Response1 | Response2 |
| [A, B, A, B, B] | [A, B, B, A] |
Would re
work, or is there something better?
Thanks!
CodePudding user response:
This will work
df = pd.read_csv(INDATA, sep=",").transpose()
df[0] = df[0].str.split("; ").apply(lambda x: [e.strip().split(":")[1].strip() for e in x])
df = df.transpose()
print(df)
Response 1 Response 2
0 [A, B, A, B, B] [A, B, B, A]
CodePudding user response:
You can stack
, extract
, unstack
:
out = df.stack().str.extractall(':\s*(\w );?')[0].unstack([-2, -1])
output:
Response 1 Response 2
match 0 1 2 3 4 0 1 2 3
0 A B A B B A B B A
If you also need the 'part' number:
out = (df
.stack().str.extractall('part\s*(?P<part>\w ):\s*(?P<answer>\w );?')
.set_index('part', append=True).droplevel('match')['answer']
.unstack([-2, -1])
)
output:
Response 1 Response 2
part 1 2 3 4 5 1 2 3 4
0 A B A B B A B B A
For a list:
out = (df
.stack().str.extractall(':\s*(\w );?')[0]
.groupby(level=[0, 1]).agg(list).unstack()
)
output:
Response 1 Response 2
0 [A, B, A, B, B] [A, B, B, A]
CodePudding user response:
Another possible solution:
(df.apply(lambda x: x.str.split('; '))
.explode(['Response 1']).explode(['Response 2'])
.melt().drop_duplicates()
.pivot_table(columns='variable', values='value',
aggfunc=lambda z: z.str.replace(r'part \d : ', '', regex=True))
)
Output:
variable Response 1 Response 2
value [A, B, A, B, B] [A, B, B, A]