How to create new column from another dataframe based on conditions-CodePudding

I am trying to join two datasets, but they are not the same or have the same criteria.

Currently I have the dataset below, which contains the number of fires based on month and year, but the months are part of the header and the years are a column.

I would like to add this data, using as target data_medicao column from this other dataset, into a new column (let's hypothetically call it nr_total_queimadas).

The date format is YYYY-MM-DD, but the day doesn't really matter here.

I tried to make a loop of this case, but I think I'm doing something wrong and I don't have much idea how to proceed in this case.

Below an example of how I would like the output with the junction of the two datasets:

I used as an example the case where some dates repeat (which should happen) so the number present in the dataset that contains the number of fires, should also repeat.

CodePudding user response：

First, I assume that the first dataframe is in variable a and the second is in variable b.

To make looking up simpler, we set the index of a to year:

a = a.set_index('year')

Then, we take the years from the data_medicao in the dataframe b:

years = b['medicao'].dt.year

To get the month name from the dataframe b, we use strftime. Then we need to make the month name into lower case so that it matches the column names in a. To do that, we use .str.lower():

month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()

Then using lookup we can list all the values from dataframe a using indices years and month_name_lowercase:

num_fires = a.lookup(years.values, month_name_lowercase.values)

Finally add the new values into the new column in b:

b['nr_total_quimadas'] = num_fires

So the complete code is like this:

a = a.set_index('year')

years = b['medicao'].dt.year
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()

num_fires = a.lookup(years.values, month_name_lowercase.values)

b['nr_total_queimadas'] = num_fires

CodePudding user response：

Assume following data for year vs month. Convert month names to numbers:

columns = ["year","jan","feb","mar"]
data = [
    (2001,110,120,130),
    (2002,210,220,230),
    (2003,310,320,330)
    ]

df = pd.DataFrame(data=data, columns=columns)
month_map = {"jan":"1", "feb":"2", "mar":"3"}
df = df.rename(columns=month_map)

[Out]:
   year    1    2    3
0  2001  110  120  130
1  2002  210  220  230
2  2003  310  320  330

Assume following data for datewise transactions. Extract year and month from date:

columns2 = ["date"]
data2 = [
    ("2001-02-15"),
    ("2001-03-15"),
    ("2002-01-15"),
    ("2002-03-15"),
    ("2003-01-15"),
    ("2003-02-15"),
    ]

df2 = pd.DataFrame(data=data2, columns=columns2)
df2["date"] = pd.to_datetime(df2["date"])
df2["year"] = df2["date"].dt.year
df2["month"] = df2["date"].dt.month

[Out]:
        date  year  month
0 2001-02-15  2001      2
1 2001-03-15  2001      3
2 2002-01-15  2002      1
3 2002-03-15  2002      3
4 2003-01-15  2003      1
5 2003-02-15  2003      2

Join on year:

df2 = df2.merge(df, left_on="year", right_on="year", how="left")

[Out]:
        date  year  month    1    2    3
0 2001-02-15  2001      2  110  120  130
1 2001-03-15  2001      3  110  120  130
2 2002-01-15  2002      1  210  220  230
3 2002-03-15  2002      3  210  220  230
4 2003-01-15  2003      1  310  320  330
5 2003-02-15  2003      2  310  320  330

Compute row-wise sum of months:

df2["nr_total_queimadas"] = df2[list(month_map.values())].apply(pd.Series.sum, axis=1)
df2[["date", "nr_total_queimadas"]]

[Out]:
        date  nr_total_queimadas
0 2001-02-15                 360
1 2001-03-15                 360
2 2002-01-15                 660
3 2002-03-15                 660
4 2003-01-15                 960
5 2003-02-15                 960