I need to set a really specific regex pattern-CodePudding

I have a pandas dataframe with values on each cell like this:

GRI 101: Foundation: 
•  Clause 1.1 (Stakeholder Inclusiveness principle)
•  Clause 1.3 (Materiality principle)
•  Clause 2.1 (Applying the Reporting Principles)
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44

By using:

df['column'].str.findall(r'\d -\d ').str.join('\n')

I am able to get the values:

Which is good, but apart from those values I got, I need a way to also extract the the 101 and append the clauses you see in the text. I need to be able to also gather those values and get something like this:

Not sure if it is possible, the reason for this is I have tons of values like this and I need to get them to use them then as reference to be able to perform a relationship between two ESG standards based on those values.

CodePudding user response：

See if this work:

Disclosures = df['0'].str.findall(r'\d -\d ').str.join('\n')[0]
top_number = str(df['0'].str.findall(r'GRI \d ').str.join('')).split('GRI')[1].strip()
clauses = str(df['0'].str.findall(r'[\d ][.][\d ]').str.join(' ')[0]).split(' ')
for c in clauses:
    print(top_number, '-', c, sep='')
print(Disclosures)

Here is my code example:

d = {'0': ["""GRI 101: Foundation: 
•  Clause 1.1 (Stakeholder Inclusiveness principle)
•  Clause 1.3 (Materiality principle)
•  Clause 2.1 (Applying the Reporting Principles)
GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44""", "b"], '1': [3, 4], '2': [4, 5], '3': [7,8], '4': [9,10], '5': [12,13], '6': [15,17]}
df = pd.DataFrame(data=d)
Disclosures = df['0'].str.findall(r'\d -\d ').str.join('\n')[0]
top_number = str(df['0'].str.findall(r'GRI \d ').str.join('')).split('GRI')[1].strip()
clauses = str(df['0'].str.findall(r'[\d ][.][\d ]').str.join(' ')[0]).split(' ')
for c in clauses:
    print(top_number, '-', c, sep='')
print(Disclosures)

output:

CodePudding user response：

Try:

import re


def process(x):
    first = re.search(r"\d ", x).group(0)
    disclosures = re.findall(r"\d -\d ", x)
    clauses = re.findall(r"Clause (\d \.\d )", x)

    return "\n".join(
        ["\n".join(disclosures), "\n".join(f"{first}-{c}" for c in clauses)]
    )


df["result"] = df["column"].apply(process)
print(df.to_markdown())

Prints:

	column	result
0	GRI 101: Foundation:	102-40
	• Clause 1.1 (Stakeholder Inclusiveness principle)	102-42
	• Clause 1.3 (Materiality principle)	102-43
	• Clause 2.1 (Applying the Reporting Principles)	102-44
	GRI 102: General Disclosures: Disclosures 102-40, 102-42, 102-43, and 102-44	101-1.1
		101-1.3
		101-2.1