Home > OS >  python split the column values of a dataframe
python split the column values of a dataframe

Time:09-24

I am working on python and new to it.I have a dataframe as

Date           Emailable    Lost_Fans   New_Fans    Country
12-10-2020      121134                                 JP
06-11-2020      120859         350         75          JP
18-12-2020      101857        19128       126          JP
29-01-2021      105760         5029      8932          JP
16-02-2021      115437         1279      10956         JP
01-03-2021      115512          517       592          JP    
12-07-2021      137546         18074     40108         JP
11-08-2021      134319         23753     20526         JP
02-09-2021      130156          4355      192          JP
  1. First,I want to find the missing month from the dataframe
  2. to fill the missing month data for lost and new split the last month data value of lost and new into these missing month lost and new column. For example: the april, may june is missing month from the above dataframe and the next data available is for july which contains 18074 in lost and 40108 in new. we need to spilt these values into the missing month(april, may, june) including july.so for lost 18074/4 =4518 to each of the missing month including july. so the final output will be like.

output should be:

Date           Emailable    Lost_Fans   New_Fans    Country

12-10-2020      121134                                 JP
06-11-2020      120859         350         75          JP
18-12-2020      101857        19128       126          JP
29-01-2021      105760         5029      8932          JP
16-02-2021      115437         1279      10956         JP
01-03-2021      115512          517       592          JP 
30-04-2021       0              4518     10027         JP
31-05-2021       0              4518     10027         JP
30-06-2021       0              4518     10027         JP
12-07-2021      137546         4518      10027         JP
11-08-2021      134319         23753     20526         JP
02-09-2021      130156          4355      192          JP

CodePudding user response:

  1. To find the missing months, use the following code:

    
         months = []
         miss_months = []
         for i in range(len(df.Date)):
             if df.Date[i].split('-')[1][0] != '0':
                 months.append(df.Date[i].split('-')[1])
             else:
                 months.append(df.Date[i].split('-')[1][1])
         for j in range(1,13):
             if str(j) in months:
                 pass
             else:
                 miss_months.append(j)
         print(miss_months)
     

Here while extracting the month from the dataframe, I noticed that all the retrieved months were 2 character strings. For example: '07','08','11',etc. So I wrote some code to check if the first character of the retrieved is 0 or not.

  1. To get the final output, use the following code:

    
         def Insert_row(row_number, df, row_value):
             start_upper = 0
             end_upper = row_number
             start_lower = row_number
             end_lower = df.shape[0]
             upper_half = [*range(start_upper, end_upper, 1)]
             lower_half = [*range(start_lower, end_lower, 1)]
             lower_half = [x.__add__(1) for x in lower_half]
             index_ = upper_half   lower_half
             df.index = index_
             df.loc[row_number] = row_value
             df = df.sort_index()
             return df
         row_number = [6,7,8]
         the_date = ['30-04-2021','31-05-2021','30-06-2021']
         for k in range(len(row_number)):
             row_value = [the_date[k],0,18074//4,10027,'JP']
             df = Insert_row(row_number[k], df, row_value)
         print(df)
     

Here, I have created a function Insert_row() to make the row adding easier.

CodePudding user response:

The idea of the code is first to create rows of NaNs for the missing months. Then to fill in the rows by using the following value (bfill). And then divide the result either by 1 if there was no filling involved, or by the number of consecutive NaNs 1:

# If Date is a string, convert to datetime:
df.Date = pd.to_datetime(df.Date, format="%d-%m-%Y")

# Drop the days, keep months only
df.Date = df.Date.dt.to_period("M")

# make it the index and add rows for missing months
df.set_index("Date", inplace=True)
df = df.reindex(pd.period_range(df.index[0], df.index[-1], freq='M'))

# group the consecutive NaNs
group_nans = df.Lost_Fans.isna().groupby(df.Lost_Fans.notna().cumsum())
# for each NaN group, broadcast their length & shift it one row lower
consecutive_nans = group_nans.transform("sum").shift(fill_value=0).astype(int)
divisor = consecutive_nans   1

# fill NaNs backwards, and divide by number of consecutive nans   1    
df.Lost_Fans = df.Lost_Fans.fillna(method="bfill").astype(int) // divisor
df.New_Fans = df.New_Fans.fillna(method="bfill").astype(int) // divisor

df.Emailable = df.Emailable.fillna(0).astype(int)
df.Country = df.Country.fillna(method="bfill")

The resulting DataFrame:

         Emailable  Lost_Fans  New_Fans Country
2020-10     121134          0         0      JP
2020-11     120859        350        75      JP
2020-12     101857      19128       126      JP
2021-01     105760       5029      8932      JP
2021-02     115437       1279     10956      JP
2021-03     115512        517       592      JP
2021-04          0       4518     10027      JP
2021-05          0       4518     10027      JP
2021-06          0       4518     10027      JP
2021-07     137546       4518     10027      JP
2021-08     134319      23753     20526      JP
2021-09     130156       4355       192      JP
  • Related