Home > front end >  Why does indexing and slicing seem to change the way the dataframe looks in pandas
Why does indexing and slicing seem to change the way the dataframe looks in pandas

Time:06-26

I'm trying to extract information from certain rows in this big dataframe.

When I do use slicing to subset the table (e.g. blast_output_scored.iloc[10:11,:]), The output looks like this:

qseqid sseqid %_identity alignment_length mismatch gapopen qstart qend sstart send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1 1 100.0 1073 0 0 1 1073 7704 6632 0.0 1982.0 minus 10 5360.0

When I go to check the number of rows in this table, I get the correct number with slicing

len(blast_output_scored.iloc[10:11,:].index)

#Output
1

However when I use indexing (blast_output_scored.iloc[10,:]), the output looks completely different, even if the index is the same range as the slice.

sseqid                   1
%_identity           100.0
alignment_length      1073
mismatch                 0
gapopen                  0
qstart                   1
qend                  1073
sstart                7704
send                  6632
evalue                 0.0
bitscore            1982.0
subject_strand       minus
line_in_og_BLAST        10
Needle_score        5360.0
Name: IDgene.1, dtype: object

The number of rows in the table now doesn't seems to change to the number of columns - the first column (the first column is also set for indexing rows by names)

len(blast_output_scored.iloc[10,:].index)

#Output
14

My biggest problem is that I'm using the column names to index and I have to check which names subset tables to a length of 1, so I can't just use the splicing method to bypass this.

e.g. blast_output_scored.loc["IDgene.1"] outputs

sseqid                   1
%_identity           100.0
alignment_length      1073
mismatch                 0
gapopen                  0
qstart                   1
qend                  1073
sstart                7704
send                  6632
evalue                 0.0
bitscore            1982.0
subject_strand       minus
line_in_og_BLAST        10
Needle_score        5360.0
Name: IDgene.1, dtype: object

and will say I have 14 rows when I should only have 1.

Is there any way to ensure the output looks like the slicing output in pandas?

CodePudding user response:

When you slice - using an interval, you get a DataFrame back because the result (formally, at least) has multiple rows and multiple columns.

Take a look at

type(blast_output_scored.iloc[10:11,:])

It's a pandas.DataFrame.

Now let's look at:

type(blast_output_scored.iloc[10,:]

It's a pandas.Series.

The DataFrame and Series have quite different display in a notebook. They aren't that different, but they are a bit different. So it's good that we get a reminder that they are not the same thing.

When indexing with 10, you get the single row that corresponds to index 10. You get this as a Series. It's a one-dimensional datastructure that has an index and a sequence of values.

Since it has an index it can work very similarly to a DataFrame but with less degrees of freedom. Since it has one index and one value per unit of length, it's also vaguely similar to a dictionary or a mapping if you squint: keys (index) and values.


There are exceptions and maybe you'd be happier if you didn't know about them.

If the index of the original dataframe is non-unique, and you index using .loc[], there might be several rows that have the same index 10(!). What you get if you index with 10 in that case.. it changes and gives you a DataFrame since the result suddenly has two dimensions: multiple rows and multiple columns.

CodePudding user response:

I believe this is why df.squeeze() is a thing. That way you easily force things to a series, and design your program to always expect a series.

Example:

df.iloc[0,:].squeeze()
df.iloc[0:1,:].squeeze()

# Both output:

sseqid                   1
%_identity           100.0
alignment_length      1073
mismatch                 0
gapopen                  0
qstart                   1
qend                  1073
sstart                7704
send                  6632
evalue                 0.0
bitscore            1982.0
subject_strand       minus
line_in_og_BLAST        10
Needle_score        5360.0
Name: IDgene.1, dtype: object

If we want a dataframe, we can force that as well, but it's a bit more complicated:

x = df.iloc[0, :].squeeze()
y = df.iloc[0:1,:].squeeze()
for d in [x, y]:
    print(pd.DataFrame(d).T)

# output:

         sseqid %_identity alignment_length mismatch gapopen qstart  qend sstart  send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1      1      100.0             1073        0       0      1  1073   7704  6632    0.0   1982.0          minus               10       5360.0
         sseqid %_identity alignment_length mismatch gapopen qstart  qend sstart  send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1      1      100.0             1073        0       0      1  1073   7704  6632    0.0   1982.0          minus               10       5360.0
  • Related