I'm a newbie in Pandas and regular expressions.
I want to obtain from a file, the dates in a determined format, order by time and show the index position in a new column.
The requirement is from the main code read the file and call a function to solve the problem mentioned.
So, I created a function to obtain the results and it is found right, but apparently, some problems appear (warning in the code), and I'm not ending to understand why.
Maybe, the warning appears when the data frame sends the data to the function.
Some recommendation ? Does anybody help me to discover what I doing wrong?
I did the following code:
def date_sort():
import re
import pandas as pd
import sys
df = pd.DataFrame(date_reader, columns=['year'])
#search the digits
df1=df['year'].str.extract(r'(\d?\d\d\d)')
df2 = pd.DataFrame(df1, columns=['year'])
#count 4 digits
df2['year'].str.len()
#orderby
df3=df2.sort_values(["year"])
#Add new column
tabla=df3.reset_index()
return tabla #return the results to show in main code
#main code that call the function
import pandas as pd
import re
with open('text_with_lot_of_text_and_dates.txt') as file:
for lines in file:
tot6 = file.read()
date_reader = re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}',tot6)
result = date_sorter()
print(result.T.head(1).T)
The main code shows:
['24 Jan 2001', '10 Sep 2004', '26 May 1982', '28 June 2002', '06 May 1972', '30 Nov 2007', '28 June 1994', '14 Jan 1981', '11 February 1985', '10 Feb 1983', '05 Feb 1992', '14 Feb 1995', '30 May 2016', '22 Jan 1996', '11 Nov 2004', '30 May 2001', '02 Feb 1978', '09 Sep 1989', '12 March 1980', '22 June 1990', '28 Sep 2015', '13 Jan 1972', '06 Mar 1974', '26 May 1974' ... ]
0 2001
1 2004
2 1982
3 2002
4 1972
5 2007
The function returns:
a) The function doing all activities to create a new column with the position of the index of the each data year value, and retun to main function:
index year
0 4 1972
1 21 1972
2 47 1972
3 72 1972
4 162 1973
5 154 1973
The main function shows: the main function only shows the first column with the new_colum_index and the index:
index
0 4
1 21
2 47
3 72
4 162
5 154
The error that appear:
FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
Thanks in advance !!
CodePudding user response:
df1=df['year'].str.extract(r'(\d?\d\d\d)')
The warning is coming from this line, it accepts a parameter named expand and your current pandas version has default value of expand=False, but in the future version of pandas its default value will be expand=True. If you want to remove this warming just give the parameter value like this
df1=df['year'].str.extract(r'(\d?\d\d\d)', expand=False)
For more information about this you can read this link in pandas documentation
CodePudding user response:
Your current code should work regardless of the future warning for now. The warning is telling you that in a future update the behavior of .extract will change. It is saying that the current default of .extract() keyword "expand" equals "None" by default. When it is updated (whenever that may be) the "expand" keyword will be equal to "True" by default.
This change could impact your code if you are depending on the expand equally "None"