Home > database >  Run time and reported cycle counts in linux perf
Run time and reported cycle counts in linux perf

Time:03-08

I have run a single thread matrix multiplication on a 4-core Intel CPU (1 thread per core), but the numbers from perf doesn't make sense.

 Performance counter stats for 'system wide':

    31,728,397,287      cpu-cycles                #    0.462 GHz                    
   131,661,730,104      ref-cycles                # 1916.425 M/sec                  
         68,701.58 msec cpu-clock                 #    4.000 CPUs utilized          
         68,701.90 msec task-clock                #    4.000 CPUs utilized          
    31,728,553,882      cpu/cpu-cycles/           #  461.830 M/sec                  

      17.176244725 seconds time elapsed

I have set the cpu frequency to minimum and watched it so, all cores were running at 800MHz. That means 1 cycle is 1.25ns. With the total cpu cycles 31,728,397,287 the execution time should be 39.66 seconds, but the run time is 17.1 seconds.

I also don't know why 0.462 GHz is written in front of cpu-cycles.

More information about the processor:

Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         800.022
CPU max MHz:                     3900,0000
CPU min MHz:                     800,0000

Any thoughts about that?

UPDATE:

I rerun the experiment with root access and specifying the user code.

# perf stat -a -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 15.4839418610
MM execution Time in seconds: 15.3758427450

 Performance counter stats for 'system wide':

    14,237,521,876      cycles:u                  #    0.230 GHz                    
    17,470,220,108      cycles                    #    0.282 GHz                    
         61,974.41 msec cpu-clock                 #    4.000 CPUs utilized          

      15.494002570 seconds time elapsed

As you can see the frequency is still not 800MHz. However, if I don't specify -a the result makes sense because cycles:u * (1/800MHz) is nearly nearly the same as elapsed time.

# perf stat -e cycles:u,cycles,cpu-clock  ./mm_double_omp 1
Using 1 threads
Total execution Time in seconds: 16.5347361100
MM execution Time in seconds: 16.4267430900
 Performance counter stats for './mm_double_omp 1':

    13.135.516.694      cycles:u                  #    0,794 GHz                    
    13.201.778.987      cycles                    #    0,798 GHz                    
         16.541,22 msec cpu-clock                 #    1,000 CPUs utilized          

      16,544487905 seconds time elapsed

      16,522146000 seconds user
       0,019997000 seconds sys

CodePudding user response:

I have run a single thread matrix multiplication

But then you counted system-wide across all 4 cores for the time when it was running. That's what perf -a does, and why you got Performance counter stats for 'system wide': in your ouput.

So the kernel's task-clock "event" was counting time across all 4 cores whenever your process was active:

68,701.90 msec task-clock # 4.000 CPUs utilized

That's 68.7 CPU-seconds, almost exactly 4x 17.17 s, which makes sense.


I also don't know why 0.462 GHz is written in front of cpu-cycles.

That's the average cycles/time across all 4 cores during the time your code was active. The clock doesn't tick for CPUs that are asleep, so it seems the load average was 4 * 462/800 = 2.31 CPUs not asleep during the time your code was keeping 1 core busy.

So you were benchmarking while your system was running another 1.3 threads on average. (Including CPU time waking up for interrupts)

I'm assuming your measurements were not limited to user-space-only (excluding ticks spent in kernel mode); that's another possible effect.


If you hadn't used -a, a lower frequency than the HW was running at can still happen if you count only user-space cycles, not cycles spent in the kernel. (e.g. perf stat --all-user, or with your kernel.perf_event_paranoid setting high enough to not let you profile HW events in the kernel which effectively applies :u to HW events but without printing them as cycles:u.)

If you change your kernel.perf_event_paranoid sysctl setting to 0, or run perf as root, you should see exactly 800 MHz as cycles / task-clock, since that's the speed the CPU was running when your threads were running.

Spending a lot of cycles not in user-space would indicate that your code is spending a lot of time in the kernel, probably handling page faults if you're using a big array.

  • Related