I'm trying to extract information from certain rows in this big dataframe.
When I do use slicing to subset the table (e.g. blast_output_scored.iloc[10:11,:]
), The output looks like this:
qseqid | sseqid | %_identity | alignment_length | mismatch | gapopen | qstart | qend | sstart | send | evalue | bitscore | subject_strand | line_in_og_BLAST | Needle_score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IDgene.1 | 1 | 100.0 | 1073 | 0 | 0 | 1 | 1073 | 7704 | 6632 | 0.0 | 1982.0 | minus | 10 | 5360.0 |
When I go to check the number of rows in this table, I get the correct number with slicing
len(blast_output_scored.iloc[10:11,:].index)
#Output
1
However when I use indexing (blast_output_scored.iloc[10,:]
), the output looks completely different, even if the index is the same range as the slice.
sseqid 1
%_identity 100.0
alignment_length 1073
mismatch 0
gapopen 0
qstart 1
qend 1073
sstart 7704
send 6632
evalue 0.0
bitscore 1982.0
subject_strand minus
line_in_og_BLAST 10
Needle_score 5360.0
Name: IDgene.1, dtype: object
The number of rows in the table now doesn't seems to change to the number of columns - the first column (the first column is also set for indexing rows by names)
len(blast_output_scored.iloc[10,:].index)
#Output
14
My biggest problem is that I'm using the column names to index and I have to check which names subset tables to a length of 1, so I can't just use the splicing method to bypass this.
e.g. blast_output_scored.loc["IDgene.1"]
outputs
sseqid 1
%_identity 100.0
alignment_length 1073
mismatch 0
gapopen 0
qstart 1
qend 1073
sstart 7704
send 6632
evalue 0.0
bitscore 1982.0
subject_strand minus
line_in_og_BLAST 10
Needle_score 5360.0
Name: IDgene.1, dtype: object
and will say I have 14 rows when I should only have 1.
Is there any way to ensure the output looks like the slicing output in pandas?
CodePudding user response:
When you slice - using an interval, you get a DataFrame back because the result (formally, at least) has multiple rows and multiple columns.
Take a look at
type(blast_output_scored.iloc[10:11,:])
It's a pandas.DataFrame.
Now let's look at:
type(blast_output_scored.iloc[10,:]
It's a pandas.Series.
The DataFrame and Series have quite different display in a notebook. They aren't that different, but they are a bit different. So it's good that we get a reminder that they are not the same thing.
When indexing with 10, you get the single row that corresponds to index 10
.
You get this as a Series. It's a one-dimensional datastructure that has an index and a sequence of values.
Since it has an index it can work very similarly to a DataFrame but with less degrees of freedom. Since it has one index and one value per unit of length, it's also vaguely similar to a dictionary or a mapping if you squint: keys (index) and values.
There are exceptions and maybe you'd be happier if you didn't know about them.
If the index of the original dataframe is non-unique, and you index using .loc[]
, there might be several rows that have the same index 10(!). What you get if you index with 10 in that case.. it changes and gives you a DataFrame since the result suddenly has two dimensions: multiple rows and multiple columns.
CodePudding user response:
I believe this is why df.squeeze()
is a thing. That way you easily force things to a series, and design your program to always expect a series.
Example:
df.iloc[0,:].squeeze()
df.iloc[0:1,:].squeeze()
# Both output:
sseqid 1
%_identity 100.0
alignment_length 1073
mismatch 0
gapopen 0
qstart 1
qend 1073
sstart 7704
send 6632
evalue 0.0
bitscore 1982.0
subject_strand minus
line_in_og_BLAST 10
Needle_score 5360.0
Name: IDgene.1, dtype: object
If we want a dataframe, we can force that as well, but it's a bit more complicated:
x = df.iloc[0, :].squeeze()
y = df.iloc[0:1,:].squeeze()
for d in [x, y]:
print(pd.DataFrame(d).T)
# output:
sseqid %_identity alignment_length mismatch gapopen qstart qend sstart send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1 1 100.0 1073 0 0 1 1073 7704 6632 0.0 1982.0 minus 10 5360.0
sseqid %_identity alignment_length mismatch gapopen qstart qend sstart send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1 1 100.0 1073 0 0 1 1073 7704 6632 0.0 1982.0 minus 10 5360.0