I'm using a text file that is updated every day and I want to extract the values from the string and append them to a DataFrame. The text file doesn't change structurally (at least mostly), just the values are updated, so I've written some code to extract the values preceding the keywords in my list.
To make my life easier I've tried to build a for-loop to automate as much as possible, but frustratingly I'm stuck at appending the values I've sourced to my DataFrame. All the tutorials I've looked at are dealing with ranges in for loops.
empty_df = pd.DataFrame(columns = ["date","builders","miners","roofers"])
text = "On 10 May 2022, there were 400 builders living in Rome, there were also no miners and approximately 70 roofers"
text = text.split()
profession = ["builders","miners","roofers"]
for i in text:
if i in profession:
print(text[text.index(i) - 1] " " i)
400 builders
no miners
70 roofers
I've tried to append using:
for i in text:
if i in profession:
empty_df.append(text[text.index(i) - 1] " " i)
But it doesn't work, and I'm really unsure how to append multiple calculated variables.
So what I want to know is:
- How can I append the resulting values to my empty dataframe in the correct columns.
- How could I convert the 'no' or 'none' into zero.
- How can I also incorporate the date each time I update this?
Thanks
CodePudding user response:
1)How can I append the resulting values to my empty dataframe in the correct columns.
I think you need to do a preprocess before, you should iterate in the sentence when you detect a keyword (builders) you take the words before and after (with spliting by ' '). Now the word before and after you try to transform it into a float if it works you stock the result in list of list : ['builders',400] and you have searched for everything you able to add the rows with all the informations
2) How could I convert the 'no' or 'none' into zero.
With my method you don't need if you were enable to transform the words before or after in a float, then it should be 0
3) How can I also incorporate the date each time I update this?
https://theautomatic.net/2018/12/18/2-packages-for-extracting-dates-from-a-string-of-text-in-python/
CodePudding user response:
If you just want a plug and play solution, this will get you where you need to go:
from dateutil import parser
import numpy as np
empty_df = pd.DataFrame(columns = ["builders","miners","roofers","date"])
text = "On 10 May 2022, there were 400 builders living in Rome, there were also no miners and approximately 70 roofers"
date = parser.parse(text.split(',')[0]).strftime('%d %B %Y')
foo = text.split()
profession = ["builders","miners","roofers"]
total_no = []
for i in foo:
if i in profession:
total_no.append(foo[foo.index(i) - 1])
empty_df.loc[len(empty_df)] = total_no [date]
empty_df.replace('no', np.nan)
Outputting:
builders miners roofers date
0 400 NaN 70 10 May 2022