I have a list of strings, given below from which i want to extract only numbers, and then i want to create a column based on output.
['CGST- INPUT 9% MAHARASHTRA',
'SGST-INPUT 9% MAHARASHTRA',
'CGST INPUT @6% MAHARASHTRA',
'SGST INPUT @6% MAHARASHTRA',
'CGST- INPUT 2.50% MAHARASHTRA',
'SGST-INPUT 2.50% MAHARASHTRA',
'TDS ON OFFICE RENT',
'TDS ON CONTRACTOR',
'TDS ON CONSULTANTS',
'TDS ON OFFICE RENT (COMPANY)',
'TDS ON CONSULTANY FEE']
Output should be as belows
Rate CGST SGST TDS
9 XX XX XX
6 XX XX XX
2.50 XX XX XX
I have few columns in a Dataframe which i have converted to list above. There are values in each column which i want to sum and show them saperatly as per the rate mentioned in each list item.
CodePudding user response:
To extract numbers only from a list of strings and create a column in a Pandas DataFrame based on the output, you can use a combination of the apply() method, a regular expression pattern, and the extract() function from the re module. Here is an example of how to do this:
import pandas as pd
import re
# Create a list of strings
strings = ['abc123', 'def456', 'ghi789', 'jkl012']
# Convert the list to a DataFrame
df = pd.DataFrame({'strings': strings})
# Define a regular expression pattern to extract numbers
pattern = r'\d '
# Use the apply() method and the extract() function to extract the numbers
df['numbers'] = df['strings'].apply(lambda x: re.findall(pattern, x))
# Print the DataFrame
print(df)
This code first creates a list of strings and converts it to a DataFrame. It then defines a regular expression pattern to extract numbers and uses the apply() method and the extract() function to apply the pattern to each string in the strings column. The resulting list of numbers is stored in a new column called numbers.
The output of this code will be a DataFrame with two columns: strings, which contains the original strings, and numbers, which contains a list of numbers extracted from each string.
Keep in mind that this approach assumes that the strings in the strings column contain only one set of numbers. If the strings contain multiple sets of numbers, you may need to modify the regular expression pattern or use a different approach to extract the numbers.
CodePudding user response:
A regular expression that will identify numbers in a string (including those with decimal fractions) is:
r'[- ]?[0-9]*\.?[0-9] '
So, for example :
import re
mystring = 'abc50def6.75ghi'
pattern = r'[- ]?[0-9]*\.?[0-9] '
print(list(map(float, re.findall(pattern, mystring))))
Output:
[50.0, 6.75]
Having extracted your numbers you can then use these values to build your Dataframe