Home > Software design >  Asking for help in Python for structuring text blocks in a pandas data frame using regexes
Asking for help in Python for structuring text blocks in a pandas data frame using regexes

Time:03-21

I would like to ask for help with structuring some data by using regexes. Thank you in advance for your insight. It's a big help.

I have a panda's data frame with two columns: one that contains unique IDs and one that contains congressional records. Each row of the data frame contains a unique congressional record. My data set is pretty large.

I need help restructuring the data from the congressional record. Right now my data looks like this:

SENDING AMERICAN TROOPS TO BOSNIA Mr. IHOFE. Hello. This is another 
sentence. Dr. SMITH. Here is another sentence. I would really appreciate help. 

The first all-caps sequence of words is the title of the record. Any following all-caps words usually preceded by a Mr., Ms., Dr., and so forth are the names of the speakers. The text immediately following the speaker belongs to said speaker until reaching another speaker name.

So in this example, SENDING AMERICAN TROOPS TO BOSNIA is the title. Mr. IHOFE is a speaker who said "Hello. This is another sentence." Dr. SMITH is a speaker who said: "Here is another sentence. I would really appreciate help."

I would like to transform my existing data into a data frame with columns for:

  • Record name
  • Speaker Name
  • Speaker text

I imagine the best way to do this is with regular expressions? But I am having a difficult time. one reason is because I'm having trouble matching with, for example, the record title and not the following "M" in "Mr." For example, ^([A-Z] (?=\s[A-Z])(?:\s[A-Z] ) ) matches with the M in Mr.: SENDING AMERICAN TROOPS TO BOSNIA M.

To make this problem more complicated, sometimes a record title has punctuation, like so: PRESIDENT CLINTON'S VISIT TO ENGLAND, NORTHERN IRELAND, AND IRELAND. In any case, the title will still be the first sequence of all-caps words.

Thank you in advance for your advice on this problem.

CodePudding user response:

This probably will fail on various edge cases, but hopefully gives you a place to start.

Example data:

df = pd.DataFrame(
    {'id': [0, 1, 2],
     'raw': [("SENDING AMERICAN TROOPS TO BOSNIA Mr. IHOFE. Hello. "
              "This is another sentence. Dr. SMITH. Here is another" 
             "sentence. I would really appreciate help."),
             ("PRESIDENT CLINTON'S VISIT TO ENGLAND, NORTHERN " 
              "IRELAND, AND IRELAND Prof. MULLEN. Here's some more " 
              "text, with $pec!al ch&r@cters. Mrs. SMITH. Nice, nice."),
             ("ALL-CAPS TITLES, WITH COMMAS & CHARS, SHOULD "
              "COME THROUGH! Professor X. Let's hope this works. " 
              "Ms. REALIST. This isn't perfect, but may be good " 
              "enough. Mr. OTHER. Oh, really? Ms. REALIST. Yeah, "
              "let's go with it.")]
    }
)

Solution:

# Match from start of string to word boundary before first lowercase
df['title'] = df['raw'].str.extract(r'^(?=. [a-z])([A-Z\W] \b)')

# Trim whitespace
df['title'] = df['title'].str.strip()

# Remove matched titles from raw data and store as new column
df['body'] = df['raw'].str.replace(r'^(?=. [a-z])([A-Z\W] \b)', '', regex=True)

# Split on speaker names
df['split'] = df['body'].str.split(r'([A-Z][\w] .\s?[A-Z] \b.)')
# Discard empty strings; this should give an even number of 
# list items per row
df['split'] = df['split'].apply(lambda row: [x for x in row if x])

# List items 0, 2, 4, ... are speaker names
df['speaker_name'] = df['split'].str[::2]
# List items 1, 3, 5, ... are spoken text
df['speaker_text'] = df['split'].str[1::2]

pd.options.display.max_colwidth = 200
res = df[['id', 'title', 'speaker_name', 'speaker_text']].copy()
res = res.explode(['speaker_name', 'speaker_text'])

# Delete trailing period from speaker names
res['speaker_name'] = res['speaker_name'].str.rstrip('.')

print(res.to_markdown(index=False))
id title speaker_name speaker_text
0 SENDING AMERICAN TROOPS TO BOSNIA Mr. IHOFE Hello. This is another sentence.
0 SENDING AMERICAN TROOPS TO BOSNIA Dr. SMITH Here is another sentence. I would really appreciate help.
1 PRESIDENT CLINTON'S VISIT TO ENGLAND, NORTHERN IRELAND, AND IRELAND Prof. MULLEN Here's some more text, with $pec!al ch&r@cters.
1 PRESIDENT CLINTON'S VISIT TO ENGLAND, NORTHERN IRELAND, AND IRELAND Mrs. SMITH Nice, nice.
2 ALL-CAPS TITLES, WITH COMMAS & CHARS, SHOULD COME THROUGH! Professor X Let's hope this works.
2 ALL-CAPS TITLES, WITH COMMAS & CHARS, SHOULD COME THROUGH! Ms. REALIST This isn't perfect, but may be good enough.
2 ALL-CAPS TITLES, WITH COMMAS & CHARS, SHOULD COME THROUGH! Mr. OTHER Oh, really?
2 ALL-CAPS TITLES, WITH COMMAS & CHARS, SHOULD COME THROUGH! Ms. REALIST Yeah, let's go with it.
  • Related