Python dataframe from 2 text files (different number of columns)-CodePudding

I need to make a dataframe from two txt files.

The first txt file looks like this Street_name space id. The second txt file loks like this City_name space id.

Example:

text file 1:

Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567

text file 2:

Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567

I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.

I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.

I want a df with 3 rows and 3 cols.

Thanks!

CodePudding user response：

It is NOT correct CSV and it may need to read it on your own.

You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.

Minimal working example:

I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()

text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''

import io

#with open('filename') as fh:
with io.StringIO(text) as fh:
    lines = fh.read().splitlines()

print(lines)

lines = [line.rsplit(" ", 1) for line in lines]

print(lines)

import pandas as pd

df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)

Result:

['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']

[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]

           name   number
0  Roseberry st     1234
1   Brooklyn st     4321
2      Wolseley  1234567

EDIT:

read_csv can use regex to define separator (i.e. sep="\s " for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.

text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''

import io
import pandas as pd

#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')

df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')

print(df)

Result:

           name   number
0  Roseberry st     1234
1   Brooklyn st     4321
2      Wolseley  1234567

And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.

text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''

text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''

import io
import pandas as pd

df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')

print(df1)
print(df2)

df = df1.merge(df2, on='id')

print(df)

Result:

    street name       id
0  Roseberry st     1234
1   Brooklyn st     4321
2      Wolseley  1234567

  city name       id
0  Winnipeg     4321
1  Winnipeg     1234
2  Ste Anne  1234567

    street name       id city name
0  Roseberry st     1234  Winnipeg
1   Brooklyn st     4321  Winnipeg
2      Wolseley  1234567  Ste Anne

Pandas doc: Merge, join, concatenate and compare

CodePudding user response：

There's nothing that I'm aware of in pandas that does this automatically. Below, I built a script that will merge those addresses (addy st) into a single column, then merges the two data frames into one based on the "id".

I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.

Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items. After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".

Couple caveats:

Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"

import pandas as pd
import numpy as np

# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'

# declares first text file
with open(text_path_1) as f1:
    text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
    text_file_2 = f2.readlines()

# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
    data_list = []
    for item in text_file_lines:
        stripped_item = item.strip('\n')
        split_stripped_item = stripped_item.split(' ')

        if len(split_stripped_item) == 3:
            split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]

        data_list.append(split_stripped_item)
    return data_list

# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)

# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])

# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')

# prints dataframe (assuming you're using something like jupyter-lab)
merged_df

CodePudding user response：

pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.

import pandas as pd

street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d )$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)

city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d )$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)

street_df["city"] = city_df["city"]
print(street_df)