I wish to extract the data from a txt file which is given below and store in to a pandas Dataframe that has 8 columns.
Lorem | Ipsum | is | simply | dummy
text | of | the | printing | and
typesetting | industry. | Lorem
more | recently | with | desktop | publishing | software | like | Aldus
Ipsum | has | been | the | industry's
standard | dummy | text | ever | since | the | 1500s
took | a | galley | of | type | and
scrambled | it | to | make | a | type | specimen | book
It | has | survived | not | only | five | centuries, | but
the | leap | into | electronic | typesetting
remaining | essentially | unchanged
It | was | popularised | in | the | 1960s | with | the
Lorem | Ipsum | passages, | and
PageMaker | including | versions | of | Lorem | Ipsum
Data on each line is separated by a pipe sign which refers to a data inside each cell of a row and column. My end goal is to have the data inserted in dataframe as per below format.
Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | Column 6 | Column 7 | Column 8
-------------------------------------------------------------------------------------
Lorem | Ipsum | is | simply | dummy |
text | of | the | printing | and |
typesetting| industry. | Lorem |
more | recently | with | desktop | publishing| software | like | Aldus |
and so on.....
I performed below but I am unable to add data dynamically into dataframe.
import pandas as pd
with open(file) as f:
data = f.read().split('\n')
columns = ['Column 1', 'Column 2', 'Column 3', 'Column 4', 'Column 5', 'Column 6', 'Column 7', 'Column 8']
df = pd.DataFrame(columns=columns)
for i in data:
row = i.split(' | ')
df = df.append({'Column 1': f'{row[0]}', 'Column 2': f'{row[1]}', 'Column 3': f'{row[2]}', 'Column 4': f'{row[3]}', 'Column 5': f'{row[4]}'}, ignore_index = True)
Above is manual way of adding row's cells to a dataframe, but I require the dynamic way i.e. how do append the rows so as whatever may be number of cells in row, it may get added.
CodePudding user response:
Use read_csv
for read txt
file:
names = [f"Column {i}" for i in range(1, 9)]
df = pd.read_csv(file, sep="\s \|\s ", names = names, header=None)
print (df)
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 \
0 Lorem Ipsum is simply dummy None
1 text of the printing and None
2 typesetting industry. Lorem None None None
3 more recently with desktop publishing software
4 Ipsum has been the industry's None
5 standard dummy text ever since the
6 took a galley of type and
7 scrambled it to make a type
8 It has survived not only five
9 the leap into electronic typesetting None
10 remaining essentially unchanged None None None
11 It was popularised in the 1960s
12 Lorem Ipsum passages, and None None
13 PageMaker including versions of Lorem Ipsum
Column 7 Column 8
0 None None
1 None None
2 None None
3 like Aldus
4 None None
5 1500s None
6 None None
7 specimen book
8 centuries, but
9 None None
10 None None
11 with the
12 None None
13 None None
CodePudding user response:
import pandas as pd
text = """
Lorem | Ipsum | is | simply | dummy
text | of | the | printing | and
typesetting | industry. | Lorem
more | recently | with | desktop | publishing | software | like | Aldus
Ipsum | has | been | the | industry's
standard | dummy | text | ever | since | the | 1500s
took | a | galley | of | type | and
scrambled | it | to | make | a | type | specimen | book
It | has | survived | not | only | five | centuries, | but
the | leap | into | electronic | typesetting
remaining | essentially | unchanged
It | was | popularised | in | the | 1960s | with | the
Lorem | Ipsum | passages, | and
PageMaker | including | versions | of | Lorem | Ipsum
"""
# Create a 'jagged' list of words...
data = [i.split(" | ") for i in text.strip().split("\n")]
# ... which you can pass to pd.DataFrame directly:
columns = [f"Column {i}" for i in range(1, 9)]
df = pd.DataFrame(data, columns=columns)
df:
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8
0 Lorem Ipsum is simply dummy None None None
1 text of the printing and None None None
2 typesetting industry. Lorem None None None None None
3 more recently with desktop publishing software like Aldus
4 Ipsum has been the industry's None None None
5 standard dummy text ever since the 1500s None
6 took a galley of type and None None
7 scrambled it to make a type specimen book
8 It has survived not only five centuries, but
9 the leap into electronic typesetting None None None
10 remaining essentially unchanged None None None None None
11 It was popularised in the 1960s with the
12 Lorem Ipsum passages, and None None None None
13 PageMaker including versions of Lorem Ipsum None None
CodePudding user response:
You can do it by creating a series for each line and then creating the dataframe by concatenating those series.
import pandas as pd
with open(file) as f:
data = f.read().split('\n')
lines = []
for i in data:
row = i.split(' | ')
lines.append(pd.Series(row))
df = pd.concat(lines, axis=1).T
You will dynamically get the right number of columns.
The columns will be named just 0
, 1
, 2
... but if you need to rename them to Column 1
, Column 2
... you can easily do it via:
df = df.rename(columns={c: f"Column {c}" for c in df.columns})
CodePudding user response:
Are you trying to append 5 columns to a dataframe with 8 columns, right? Try to read this Append Dataframe with Different Number of Columns. And also check this documentation Ways to Merge Data on Pandas.
Probably it's enought to solve this problem