Home > Software design >  regex to extract text after certain characters regardless it has one or two \n
regex to extract text after certain characters regardless it has one or two \n


I'm using Python. I need to use regex to extract certain characters for cells in a row and, and convert them into a new row. Consider the following example. I have a dataset with many dataframes that look like this:

| index | N ° Review  | Origen                    | Fecha de emisión del \nreview | Autor       |
| 1     | 23-0001-523 | Región: \nATACAMA         | 02/01/2022                    | JOSE GARCIA |
| 2     |             | Provincia: \nCOPIAPO      |                               |             |   
| 3     |             | Comuna: \nSAN PEDRO DE LA |                               |             |
|       |             | \nTIERRA AMARILLA         |                               |             |

And I need it to look like this:

|index|numero_review |fecha_emision_review  |autor      |comuna                           |provincia     |region  |
|1    |23-0001-523   |02/01/2022            |JOSE GARCIA|                                 |              |ATACAMA |
|2    |              |                      |           |                                 |COPIAPO       |        |
|3    |              |                      |           | SAN PEDRO DE LA TIERRA AMARILLA |              |        |

I have implemented the following code to achieve this objective:

table1 = camelot.read_pdf(i, pages = "all", flavor = "lattice", line_scale=40)
df = table1[0].df
df2 = df.transpose(copy = True)
tabla_1 = df2.rename(columns={0:'numero_review', 1:'origen', 2:'fecha_emision_review', 3:'autor'})
tabla_1 = tabla_1.drop(index=0)
tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:\s\W (. )$').fillna('')
tabla_1['provincia'] = tabla_1['origen'].str.extract('Provincia:\s\W (. )$').fillna('') 
tabla_1['region'] = tabla_1['origen'].str.extract('Región:\s\W (. )$').fillna('')
tabla_1.drop(['origen'], axis = 1, inplace = True)

However, the regex implemented can't detect the variable comuna as it has two \n. Consider this result:

|index|numero_review |fecha_emision_review  |autor      |comuna                           |provincia     |region  |
|1    |23-0001-523   |02/01/2022            |JOSE GARCIA|                                 |              |ATACAMA |
|2    |              |                      |           |                                 |COPIAPO       |        |
|3    |              |                      |           |                                 |              |        |

So, I need to implement a regex that in the case of comuna is able to idenfy appropiatly any text, regardless it has one or two \n

Any suggestions? The pdf used in the implemented code can be found here: https://drive.google.com/file/d/1xwTWnYUpYz_8cKqWFe9c6UoAliz2SAgH/view?usp=sharing

CodePudding user response:

Use re.MULTILINE and re.DOTALL flags like the following. (The names of the flags are self-explanatory. See the reference for details.)

tabla_1['comuna'] = tabla_1['origen'].str.extract('Comuna:\s*(. )', re.MULTILINE | re.DOTALL).fillna('')

And there are a few suggestions.

  1. Don't mix '\s'(white spaces) and '\W'(none word characters), because the latter patterns include the former ones.

  2. You don't need '$' or '\Z' to capture characters up to the end of the string because the ' ' or '*' patterns work greedy by default.

  • Related