I'm new at coding and feel like to really understand it, I have to truly grasp the concepts.
Quality of life edit:
Why do we do df[df['col a']] == x? INSTEAD of df['col a'] == x? when making a search? I understand that on the second expression I would be looking at column names that equal X but I'd love to know what does the addition of making it a list (df[]) does for the code
I would love to know the difference between those two and what I am actually doing when I nest the column on a list.
any help is appreciated thank you so much!
CodePudding user response:
So we use df[df['col a']== x]
instead of just df['col a'] == x
because to optimize the dataframe itself you are escencially telling the data frame with df['col a'] == x
that you want a bool of true false if the condition is met (you can try this on your df and will see that when you do not put it in the df[] that it only will list df['col a'] == x
as a list of true and false). so it pandas will first say "What asking"? then it will say "You asked for X here is a series of True/False based on what you asked" and finally "You asked for all of the only True values of the series here is the dataframe the reflects only true"
Does that help clear up what it is doing? Basically just pandas trying to be as optimal as possible. As well as when you start learning more and more you can add multiple arguments df[(df['col a'] == x) & (df['col b'] == y)]
which would be hard to write and keep together if you only did df['col a']
for your serach
CodePudding user response:
In general, df[index]
selects slices from a dataframe based on an index.
Pandas supports several different indexing methods. The expression in your question chains two of them together. First, the inner index df['col_a']
selects all values in column col_a
. These are evaluated in a boolean expression that returns a series that is "masked" with True where the values in the column meet a condition and False elsewhere. The outer part then uses boolean indexing to select all rows in the entire dataframe that meet this condition.
Example:
df = pd.DataFrame({'column1': [0, 1, 2, 3, 4], 'column2': ['x', 'x', 'x', 'y', 'y']})
[In] df
[Out]
column1 column2
0 a x
1 b x
2 c x
3 d y
4 e y
Selecting a single column:
[In] df['column2']
[Out]
0 x
1 x
2 x
3 y
4 y
Name: column2, dtype: object
Creating a mask:
[In] df['column2'] == 'x'
[Out]
0 True
1 True
2 True
3 False
4 False
Name: column2, dtype: bool
Selecting all rows that have value x
in column column2
:
[In] df[df['column2'] == 'x']
[Out]
column1 column2
0 a x
1 b x
2 c x