I am trying to evaluate the speed of simultaneous threads. I don't understand the result. It's like there is a lock somewherer. I am running the following on a Dell 3571, 20 core/thread i9:
unit Unit1;
interface uses Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics, Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls; type TMyThread = class(TTHread) public procedure Execute; override; end; TForm1 = class(TForm) Memo1: TMemo; procedure FormCreate(Sender: TObject); private { Private declarations } public { Public declarations } procedure Log(Sender: TMyThread; Log: string); end; var Form1: TForm1; implementation {$R *.dfm} procedure TForm1.Log(Sender: TMyThread; Log: string); begin Memo1.Lines.add(Log); end; procedure TForm1.FormCreate(Sender: TObject); var Thr: array[0..19] of TMyThread; begin for var t := 0 to 10 do begin var Thread := TMyThread.Create(True); Thr[t] := Thread; Thread.Priority := TPHigher; end; for var t := 0 to 10 do Thr[t].Resume; end; { MyThread } procedure TMyThread.Execute; begin sleep(500); try var ii: nativeint; var Start := GetTickCount; for var i := 0 to 750000000 do inc(ii); var Delta := (GetTickCount - Start); Synchronize( procedure begin Form1.Log(Self, Format( 'Done Loading : %dms', [Delta]) ); end ); except asm nop; end; end; end; end.
While running this with 1 thread, I am getting : 320 ms for one calculation While runing this with 10 theand, I am getting :
Done Loading : 344ms
Done Loading : 375ms
Done Loading : 391ms
Done Loading : 422ms
Done Loading : 438ms
Done Loading : 469ms
Done Loading : 469ms
Done Loading : 469ms
Done Loading : 516ms
Done loading : 531ms
Should all the results be almaost the same at 320 ms ?
PS: I have tried with windows CreatThread, ITask... same result whatever the number of thread...
Any Idea? thank you.
CodePudding user response:
You are spinning up 10 threads (in addition to the main thread running the application), but you are not telling Windows anything about scheduling them other than setting the priority to "Higher". All that can be determined by that "higher" thread priority is that your 10 threads all have the same priority when the Windows scheduler comes to allocate timeslice on a CPU/core.
Unless told otherwise, the scheduler will look at many factors to determine which core/CPU to schedule any thread on at any given point in time.
As a result, each thread could find itself being switched from one core to another on each timeslice, incurring a relatively expensive "context switch" each time. Or it may be scheduled on the same core it is already on. Threads that are consistently scheduled on the same core will perform "better" than any threads doing the same work with the overhead of numerous context switches.
To ensure that threads consistently run on the same core, avoiding potentially costly context switching, you need to set the Processor Affinity of each thread. This is accomplished using SetThreadAffinityMask
But It's More Complicated Than That
Whilst you can contrive to schedule each thread on a separate, consistent core, what you can't do (so easily) is also determine what else Windows decides to schedule on each core, so there will still be some variability in performance between threads performing ostensibly the same work, depending on what the core they are assigned to is also doing.
If you are embarking on a project intended to extract maximum performance from a system via threading on a range of different hardware configurations (dual/quad/hexa/octa/more-core systems), be aware that the ideal configuration of your threads will vary across those different configurations.
This is particularly true when you develop real workloads to be performed by your threads rather than synthetic metrics-gathering workloads.
You will either need to devise heuristic techniques to adapt the configuration dynamically or provide some mechanism for the software to be manually configured to "tune" performance (or both).
CodePudding user response:
Your problem is in Synchronize procedure. If you will remove Synchronize procedure and store time results in buffer like:
interface
uses
Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;
type
TMyThread = class(TTHread)
public
ThrNumber: nativeint;
procedure Execute; override;
end;
TForm1 = class(TForm)
Memo1: TMemo;
procedure FormCreate(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
procedure Log(Log: string);
end;
const
CNThr = 15;
var
Form1: TForm1;
ThrCounter: nativeint = 0;
ThrResults: array [0..CNThr] of nativeint;
implementation
{$R *.dfm}
procedure TForm1.Log(Log: string);
begin
Memo1.Lines.add(Log);
end;
procedure TForm1.FormCreate(Sender: TObject);
var
Thr: array[0..CNThr] of TMyThread;
begin
for var t := 0 to CNThr do
begin
var Thread := TMyThread.Create(True);
Thr[t] := Thread;
Thread.ThrNumber := t;
Thread.Priority := TPHigher;
end;
for var t := 0 to CNThr do
Thr[t].Resume;
sleep(10000);
for var t := 0 to CNThr do
Log(Format( 'Done Loading : %dms', [ThrResults[t]]) );
end;
{ MyThread }
procedure TMyThread.Execute;
begin
AtomicIncrement(ThrCounter);
while (ThrCounter = CNThr) do;
try
var ii: nativeint;
var Start := GetTickCount;
for var i := 0 to 750000000 do
inc(ii);
var Delta := (GetTickCount - Start);
ThrResults[ThrNumber] := Delta;
except
asm nop; end;
end;
end;
You will get (my CPU is slower) after 10 seconds results for 16 threads:
Done Loading : 1313ms
Done Loading : 1266ms
Done Loading : 1313ms
Done Loading : 1297ms
Done Loading : 1328ms
Done Loading : 1344ms
Done Loading : 1297ms
Done Loading : 1313ms
Done Loading : 1282ms
Done Loading : 1235ms
Done Loading : 1328ms
Done Loading : 1375ms
Done Loading : 1297ms
Done Loading : 1266ms
Done Loading : 1360ms
Done Loading : 1391ms