i've a big database imported from a csv file (using pd.read_csv), here's how it look in csv file:
0 1 2
0 Milan Draw Juventus
1 2.47 3.24 3.03
2 2.45 3.23 3.06
0 Napoli Draw Parma
1 1.45 4.41 7.38
2 1.45 4.40 7.36
3 1.46 4.39 7.33
4 1.47 4.33 7.14
5 1.47 4.33 7.13
6 1.47 4.34 7.10
7 1.43 4.54 7.70
0 Fiorentina Draw Pisa
1 2.86 3.50 2.45
2 2.92 3.51 2.40
3 3.14 3.55 2.25
4 2.79 3.45 2.61
I need the dataframe to look like this:
0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61
It's very easy to do it in excel with a formula but i would like to be able to do it in python since the csv file is very very big so managing with pandas is way faster but i don't know if it is possible nd how to do it...thanks!
CodePudding user response:
Here's a way to do what your question asks:
df = pd.read_csv('eestlane.txt', sep=r"\s ")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]
Output:
0 1 2 3 4
0 Milan Draw 2.47 3.24 3.03
1 Milan Draw 2.45 3.23 3.06
2 Napoli Draw 1.45 4.41 7.38
3 Napoli Draw 1.45 4.40 7.36
4 Napoli Draw 1.46 4.39 7.33
5 Napoli Draw 1.47 4.33 7.14
6 Napoli Draw 1.47 4.33 7.13
7 Napoli Draw 1.47 4.34 7.10
8 Napoli Draw 1.43 4.54 7.70
9 Fiorentina Draw 2.86 3.50 2.45
10 Fiorentina Draw 2.92 3.51 2.40
11 Fiorentina Draw 3.14 3.55 2.25
12 Fiorentina Draw 2.79 3.45 2.61
Explanation:
- use
read_csv
to get a 3-column dataframe with an index that contains 0 only for rows with names - use
reset_index
to get an index without duplicates, andrename
to change the original index to a column namedzero_for_names
- create two new columns
new1, new2
and use masking onzero_for_names
together withreindex
and itsffill
method arg to prepare these columns to be the first two columns of the target output specified in the question - use
zero_for_names
to filter out original name rows, then drop this column and usereset_index
to get a new index without gaps - rearrange the columns into the desired order
- update
df.columns
to match the desired column names (integers as strings) shown in the question.
CodePudding user response:
Try:
def isfloat(x):
try:
float(x)
return True
except ValueError:
return False
df = pd.read_csv("your_file.csv", sep=r"\s ") # <-- you may to adjust sep= accordingly
# make sure the columns are of type int
df.columns = map(int, df.columns)
mask = df.applymap(isfloat)
x = df[mask].copy()
df[mask] = np.nan
df[[3, 4, 5]] = x
df[[0, 1, 2]] = df[[0, 1, 2]].ffill()
df = df.dropna().reset_index(drop=True).drop(columns=1)
df.columns = range(len(df.columns))
print(df)
Prints:
0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61