Home > Back-end >  I need to set a really specific regex pattern
I need to set a really specific regex pattern

Time:06-11

I have a pandas dataframe with values on each cell like this:

GRI 101: Foundation: 
•  Clause 1.1 (Stakeholder Inclusiveness principle)
•  Clause 1.3 (Materiality principle)
•  Clause 2.1 (Applying the Reporting Principles)
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44

By using:

df['column'].str.findall(r'\d -\d ').str.join('\n') 

I am able to get the values:

102-40
102-42
102-43
102-44

Which is good, but apart from those values I got, I need a way to also extract the the 101 and append the clauses you see in the text. I need to be able to also gather those values and get something like this:

102-40
102-42
102-43
102-44
101-1.1
101-1.3
101-2.1

Not sure if it is possible, the reason for this is I have tons of values like this and I need to get them to use them then as reference to be able to perform a relationship between two ESG standards based on those values.

CodePudding user response:

See if this work:

Disclosures = df['0'].str.findall(r'\d -\d ').str.join('\n')[0]
top_number = str(df['0'].str.findall(r'GRI \d ').str.join('')).split('GRI')[1].strip()
clauses = str(df['0'].str.findall(r'[\d ][.][\d ]').str.join(' ')[0]).split(' ')
for c in clauses:
    print(top_number, '-', c, sep='')
print(Disclosures)

Here is my code example:

d = {'0': ["""GRI 101: Foundation: 
•  Clause 1.1 (Stakeholder Inclusiveness principle)
•  Clause 1.3 (Materiality principle)
•  Clause 2.1 (Applying the Reporting Principles)
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44""", "b"], '1': [3, 4], '2': [4, 5], '3': [7,8], '4': [9,10], '5': [12,13], '6': [15,17]}
df = pd.DataFrame(data=d)
Disclosures = df['0'].str.findall(r'\d -\d ').str.join('\n')[0]
top_number = str(df['0'].str.findall(r'GRI \d ').str.join('')).split('GRI')[1].strip()
clauses = str(df['0'].str.findall(r'[\d ][.][\d ]').str.join(' ')[0]).split(' ')
for c in clauses:
    print(top_number, '-', c, sep='')
print(Disclosures)

output:

101-1.1
101-1.3
101-2.1
102-40
102-42
102-43
102-44

CodePudding user response:

Try:

import re


def process(x):
    first = re.search(r"\d ", x).group(0)
    disclosures = re.findall(r"\d -\d ", x)
    clauses = re.findall(r"Clause (\d \.\d )", x)

    return "\n".join(
        ["\n".join(disclosures), "\n".join(f"{first}-{c}" for c in clauses)]
    )


df["result"] = df["column"].apply(process)
print(df.to_markdown())

Prints:

column result
0 GRI 101: Foundation: 102-40
• Clause 1.1 (Stakeholder Inclusiveness principle) 102-42
• Clause 1.3 (Materiality principle) 102-43
• Clause 2.1 (Applying the Reporting Principles) 102-44
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44 101-1.1
101-1.3
101-2.1
  • Related