Modify pandas dataframe imported from csv file-CodePudding

i've a big database imported from a csv file (using pd.read_csv), here's how it look in csv file:

      0       1       2
0   Milan   Draw    Juventus
1    2.47    3.24    3.03
2    2.45    3.23    3.06
0   Napoli  Draw    Parma
1    1.45    4.41    7.38
2    1.45    4.40    7.36
3    1.46    4.39    7.33
4    1.47    4.33    7.14
5    1.47    4.33    7.13
6    1.47    4.34    7.10
7    1.43    4.54    7.70
0   Fiorentina  Draw    Pisa
1    2.86    3.50    2.45
2    2.92    3.51    2.40
3    3.14    3.55    2.25
4    2.79    3.45    2.61

I need the dataframe to look like this:

      0         1         2       3       4
0   Milan   Juventus     2.47    3.24    3.03
1   Milan   Juventus     2.45    3.23    3.06
2   Napoli  Parma       1.45     4.41    7.38
3   Napoli  Parma       1.45     4.40    7.36
4   Napoli  Parma       1.46     4.39    7.33
5   Napoli  Parma       1.47     4.33    7.14
6   Napoli  Parma       1.47     4.33    7.13
7   Napoli  Parma       1.47     4.34    7.10
8   Napoli  Parma       1.43     4.54    7.70
9   Fiorentina  Pisa     2.86    3.50    2.45
10  Fiorentina  Pisa     2.92    3.51    2.40
11  Fiorentina  Pisa     3.14    3.55    2.25
12  Fiorentina  Pisa     2.79    3.45    2.61

It's very easy to do it in excel with a formula but i would like to be able to do it in python since the csv file is very very big so managing with pandas is way faster but i don't know if it is possible nd how to do it...thanks!

CodePudding user response：

Here's a way to do what your question asks:

df = pd.read_csv('eestlane.txt', sep=r"\s ")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]

Output:

             0     1     2     3     4
0        Milan  Draw  2.47  3.24  3.03
1        Milan  Draw  2.45  3.23  3.06
2       Napoli  Draw  1.45  4.41  7.38
3       Napoli  Draw  1.45  4.40  7.36
4       Napoli  Draw  1.46  4.39  7.33
5       Napoli  Draw  1.47  4.33  7.14
6       Napoli  Draw  1.47  4.33  7.13
7       Napoli  Draw  1.47  4.34  7.10
8       Napoli  Draw  1.43  4.54  7.70
9   Fiorentina  Draw  2.86  3.50  2.45
10  Fiorentina  Draw  2.92  3.51  2.40
11  Fiorentina  Draw  3.14  3.55  2.25
12  Fiorentina  Draw  2.79  3.45  2.61

Explanation:

use read_csv to get a 3-column dataframe with an index that contains 0 only for rows with names
use reset_index to get an index without duplicates, and rename to change the original index to a column named zero_for_names
create two new columns new1, new2 and use masking on zero_for_names together with reindex and its ffill method arg to prepare these columns to be the first two columns of the target output specified in the question
use zero_for_names to filter out original name rows, then drop this column and use reset_index to get a new index without gaps
rearrange the columns into the desired order
update df.columns to match the desired column names (integers as strings) shown in the question.

CodePudding user response：

Try:

def isfloat(x):
    try:
        float(x)
        return True
    except ValueError:
        return False


df = pd.read_csv("your_file.csv", sep=r"\s ")  # <-- you may to adjust sep= accordingly

# make sure the columns are of type int
df.columns = map(int, df.columns)

mask = df.applymap(isfloat)
x = df[mask].copy()
df[mask] = np.nan
df[[3, 4, 5]] = x

df[[0, 1, 2]] = df[[0, 1, 2]].ffill()
df = df.dropna().reset_index(drop=True).drop(columns=1)
df.columns = range(len(df.columns))

print(df)

Prints:

             0         1     2     3     4
0        Milan  Juventus  2.47  3.24  3.03
1        Milan  Juventus  2.45  3.23  3.06
2       Napoli     Parma  1.45  4.41  7.38
3       Napoli     Parma  1.45  4.40  7.36
4       Napoli     Parma  1.46  4.39  7.33
5       Napoli     Parma  1.47  4.33  7.14
6       Napoli     Parma  1.47  4.33  7.13
7       Napoli     Parma  1.47  4.34  7.10
8       Napoli     Parma  1.43  4.54  7.70
9   Fiorentina      Pisa  2.86  3.50  2.45
10  Fiorentina      Pisa  2.92  3.51  2.40
11  Fiorentina      Pisa  3.14  3.55  2.25
12  Fiorentina      Pisa  2.79  3.45  2.61