Understanding CPU pipeline stages vs. Instruction throughput -

- February 15, 2011

i'm missing fundamental re. cpu pipelines: @ basic level, why instructions take differing numbers of clock cycles complete , how come instructions take 1 cycle in multi-stage cpu?

besides obvious of "different instructions require different amount of work complete", hear me out...

consider i7 approx 14 stage pipeline. takes 14 clock cycles complete run-through. afaik, should mean entire pipeline has latency of 14 clocks. yet isn't case.

an xor completes in 1 cycle , has latency of 1 cycle, indicating doesn't go through 14 stages. bsr has latency of 3 cycles, throughput of 1 per cycle. aam has latency of 20 cycles (more stage count) , throughput of 8 (on ivy bridge).

some instructions cannot issued every clock, yet take less 14 clocks complete.

i know multiple execution units. don't understand how length of instructions in terms of latency , throughput relate number of pipline stages.

i'm missing fundamental re. cpu pipelines: @ basic level, why instructions take differing numbers of clock cycles complete , how come instructions take 1 cycle in multi-stage cpu?

because we're interested in in speed between instructions, not start end time of single instruction.

besides obvious of "different instructions require different amount of work complete", hear me out...

well that's key answer why different instructions have different latencies.

consider i7 approx 14 stage pipeline. takes 14 clock cycles complete run-through. afaik, should mean entire pipeline has latency of 14 clocks. yet isn't case.

that correct, though that's not particularly meaningful number. example, why care how long takes before cpu entirely done instruction? has no effect.

an xor completes in 1 cycle , has latency of 1 cycle, indicating doesn't go through 14 stages. bsr has latency of 3 cycles, throughput of 1 per cycle. aam has latency of 20 cycles (more stage count) , throughput of 8 (on ivy bridge).

this bunch of misunderstandings. xor introduces 1 cycle of latency dependency chain. is, if 12 instructions each modify previous instruction's value , add xor 13th instruction, take 1 cycle more. that's latency means.

some instructions cannot issued every clock, yet take less 14 clocks complete.

right. so?

i know multiple execution units. don't understand how length of instructions in terms of latency , throughput relate number of pipline stages.

they don't. why should there connection? there's 14 stages @ beginning of pipeline. why effect latency or throughput @ all? mean happens 14 clock cycles later, still @ same rate. (though impact cost of mispredicted branch , other things.)

Search This Blog

ITEMscalal

Understanding CPU pipeline stages vs. Instruction throughput -

Comments

Post a Comment

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

Fatal error: Call to undefined function menu_execute_active_handler() in drupal 7.9 -

python - RuntimeWarning: PyOS_InputHook is not available for interactive use of PyGTK -