Home > Software design >  How to create a dataframe column (with repetition allowed!) based on conditional statements involvin
How to create a dataframe column (with repetition allowed!) based on conditional statements involvin

Time:09-17

I have two dataframes that each have datetime columns. df1 has repeated datetimes, df2 does not.

I'm trying to assign values from df2 to a new column in df1 at every place where the datetimes match, including at all repeated values in df1.

Now, I can do this by making nested for loops and iterating over the index values, but this is decidedly slow and apparently considered bad form with dataframes as it doesn't make use of their most excellent methods.

example (non-functional) code:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:03:00', '3/10/2000 17:04:00'],
                'value1': [2, 3, 4],
                'value2': ['x', 'y', 'z']})
df1['datetime'] = pd.to_datetime(df1['datetime'])
print('df1',df1)

df2 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:04:00', '3/10/2000 17:05:00'],
                'value1': [2, 3, 4],
                'value2': ['x', 'y', 'z'],
                'value3': ['p', 'd', 'q']})
df2['datetime'] = pd.to_datetime(df2['datetime'])
print('df2',df2)

df1.loc[:, 'value3'] = df2.loc[df1.loc[:, 'datetime'] == df2.loc[:, 'datetime'], 'value3']
print('modified df1',df1)

output:

df1              datetime  value1 value2
0 2000-03-10 17:03:00       2      x
1 2000-03-10 17:03:00       3      y
2 2000-03-10 17:04:00       4      z
df2              datetime  value1 value2 value3
0 2000-03-10 17:03:00       2      x      p
1 2000-03-10 17:04:00       3      y      d
2 2000-03-10 17:05:00       4      z      q
modified df1              datetime  value1 value2 value3
0 2000-03-10 17:03:00       2      x      p
1 2000-03-10 17:03:00       3      y    NaN
2 2000-03-10 17:04:00       4      z    NaN

Note the output (printed as "modified df1") has 'p, NaN, NaN' in df1['value3'] when I need it to have 'p, p, d'.

How to make this work? Is there some sort of recursive flag I can set or is this totally the wrong approach using .loc? Again, I'm trying to avoid bad form (and Pandas warnings!) of doing this with nested for loops. Thank you all for the help.

CodePudding user response:

Stef has the right idea in the comment.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:03:00', '3/10/2000 17:04:00'],
                'value1': [2, 3, 4],
                'value2': ['x', 'y', 'z']})
df1['datetime'] = pd.to_datetime(df1['datetime'])
print('df1',df1)

df2 = pd.DataFrame({'datetime': ['3/10/2000 17:03:00', '3/10/2000 17:04:00', '3/10/2000 17:05:00'],
                'value1': [2, 3, 4],
                'value2': ['x', 'y', 'z'],
                'value3': ['p', 'd', 'q']})
df2['datetime'] = pd.to_datetime(df2['datetime'])
print('df2',df2)

df1 = df1.merge(df2[["datetime", "value3"]], on = "datetime", suffixes=("", ""))
print('modified df1',df1)
  • Related