import pandas as pd
# list of paragraphs from judicial opinions
# rows are opinions
# columns are paragraphs from the opinion
opinion1 = ['sentenced to life','sentenced to death. The sentence ...','', 'sentencing Appellant for a term of life imprisonment']
opinion2 = ['Justice Smith','This concerns a sentencing hearing.', 'The third sentence read ...', 'Defendant rested.']
opinion3 = ['sentence sentencing sentenced','New matters ...', 'The clear weight of the evidence', 'A death sentence']
data = [opinion1, opinion2, opinion3]
df = pd.DataFrame(data, columns = ['p1','p2','p3','p4'])
# This works for one column. I have 300 in the real data set.
df['p2'].str.contains('sentenc')
How do I determine whether 'sentenc' is in columns 'p1' through 'p4'?
Desired output would be something like:
True True False True
False True True False
True False False True
How do I retrieve a count of the number of times that 'sentenc' appears in each cell?
Desired output would be a count for each cell of the number of times 'sentenc' appears:
1 2 0 1
0 1 1 0
3 0 0 1
Thank you!
CodePudding user response:
Use pd.Series.str.count
:
counts = df.apply(lambda col: col.str.count('sentenc'))
Output:
>>> counts
p1 p2 p3 p4
0 1 2 0 1
1 0 1 1 0
2 3 0 0 1
To get it in boolean form, use .str.contains
, or call .astype(bool)
with the code above:
bools = df.apply(lambda col: col.str.contains('sentenc'))
or
bools = df.apply(lambda col: col.str.count('sentenc')).astype(bool)
Both will work just fine.