I have a .txt that goes like this:
USA
Arizona - New Mexico
Interstate 40
Interstate 10
South Dakota - Minneapolis
Interstate 90
South Carolina - Washington
Arizona - California
Interstate 40
Interstate 10
Interstate 8
ANOTHER COUNTRY
State A - State B
Highway 1
Highway 2
Highway 3
...
...
I want to create a DataFrame and a CSV in pandas, where the first column contains the States, and the second column the Highway.
States HW_Number
Arizona - New Mexico Interstate 40
Arizona - New Mexico Interstate 10
South Dakota - Minneapolis Interstate 90
Arizona - California Interstate 40
Arizona - California Interstate 10
Arizona - California Interstate 9
State A - State B Highway 1
State A - State B Highway 2
State A - State B Highway 3
How can I manage to do that? Not all the states have the same amount of Highways, and can have 0 Highways, and those that have 0, I do not want to be integrated in the DataFrame.
A column with the Country could be integrated as well.
Thank you
CodePudding user response:
As I said, a pretty easy file to parse:
import pandas as pd
rows = []
state = None
for line in open('x.txt'):
if line[0] == ' ':
continue
line = line.strip()
if not line:
continue
if '-' in line:
state = line
else:
rows.append( (state,line) )
df = pd.DataFrame(rows, columns=['state','road'])
print(df)
Output:
----------
state road
0 Arizona - New Mexico Interstate 40
1 Arizona - New Mexico Interstate 10
2 South Dakota - Minneapolis Interstate 90
3 Arizona - California Interstate 40
4 Arizona - California Interstate 10
5 Arizona - California Interstate 8
6 State A - State B Highway 1
7 State A - State B Highway 2
8 State A - State B Highway 3
CodePudding user response:
You can iterate through the rows and use characteristics of your structured data to create lists. These lists can be used to make a dataframe or series.
- read the lines from the file into a list (f.readlines())
- remove empty rows
- keep track of current state (doesn't end with a number)
- append the states and highways to lists
- use lists to make a dataframe or series
import pandas as pd
import io
f = io.StringIO(
"""
USA
Arizona - New Mexico
Interstate 40
Interstate 10
South Dakota - Minneapolis
Interstate 90
South Carolina - Washington
Arizona - California
Interstate 40
Interstate 10
Interstate 8
ANOTHER COUNTRY
State A - State B
Highway 1
Highway 2
Highway 3
"""
)
lines = f.readlines()
states = []
hw_numbers = []
current_state = None
for line in lines:
line = line.strip() #removes \n
if line == '': #remove empty rows
continue
elif line[-1].isdigit() == False: #if not a digit, then it's a state
current_state = line
else: #if it is a digit, then it's a highway
states.append(current_state)
hw_numbers.append(line)
pd.DataFrame({
'States':states,
'HW_number':hw_numbers
})