Speeding up the graphics on Pentium Pro / Pentium II computers

Technical description

The Pentium Pro and Pentium II processors contain a cache memory for speeding up the access to frequently needed data. First level cache resides in the processor chip, second level cache is on the chip (Pentium Pro), in the processor package (Pentium II) or there is none (older Celerons).

The very principle of the cache is to duplicate the data from external memory in the caches. This redundant data must be carefully synchronised - otherwise there would be e.g. a possibility of a DMA periphery getting old copy of data, because the new data is in the cache only.

Several strategies are possible for this synchronisation. Either all the writes access the cache and main memory at the same time (write-through), or the write to the memory is delayed to a more convenient time (write-back).

There are cases where the memory is located on a device and is accessed via some kind of device bus. The graphics card on a PCI or AGP bus is a good example. These buses have higher throughput, when the data comes in larger chunks that are transferred in one transaction - this is called write combining. The operations that need to transfer larger continuous blocks (and are not performed locally by the accelerator on the device itself) can benefit from such setting.

Pentium Pro and Pentium II processors contain registers that can be used to specify a strategy for communication with the external memory for a number of physical address ranges (MTRR - Memory Type Range Register). The Linux operating system provides an access to these registers from the user space.

Configuration

For this feature to be exploited you need the following:

For the above configuration (96 MB of memory, linear buffer on 0xE0000000 and 2 MB (i.e. 0x200000 hex) of graphics memory) the write-combine mode can be enabled using

  echo "base=0xe0000000 size=0x200000 type=write-combining" > /proc/mtrr

(naturally only root can set this) and can be checked using cat /proc/mtrr again:

  reg00: base=0x00000000 (   0MB), size=  64MB: write-back, count=1
  reg01: base=0x04000000 (  64MB), size=  32MB: write-back, count=1
  reg02: base=0xe0000000 (3584MB), size=   2MB: write-combining, count=1

In the case all does function normally you can write this command into some script called at the boot time.

Results

I use the described setting more than a year on a Pentium Pro / 166 MHz computer and a PCI S3 Trio64V+ graphics card. The X server is XF86_S3 version 3.3.3, graphics mode 1440x1080 @ 256 colors. The x11perf performance testing tool reveals more than 50% acceleration of the following operations:

Ratio Operation
3.09 500x500 tiled rectangle (161x145 tile)
3.00 500x500 tiled rectangle (216x208 tile)
2.86 100x100 tiled rectangle (216x208 tile)
2.78 100x100 tiled rectangle (161x145 tile)
2.74 Copy 500x500 from pixmap to window
2.68 ShmPutImage 500x500 square
2.56 Copy 100x100 from pixmap to window
2.44 ShmPutImage 100x100 square
2.17 Fill 300x300 tiled trapezoid (4x4 tile)
2.09 Fill 300x300 tiled trapezoid (161x145 tile)
2.06 Fill 300x300 tiled trapezoid (216x208 tile)
1.82 Fill 100x100 tiled trapezoid (4x4 tile)
1.76 Fill 300x300 tiled trapezoid (17x15 tile)
1.66 Destroy window via parent (200 kids)
1.51 Unmap window via parent (16 kids)
1.51 PutImage 100x100 square

The following operations were at least 10% slower:

Ratio Operation
0.89 Create and map subwindows (4 kids)
0.88 Fill 10x10 tiled trapezoid (161x145 tile)
0.87 Fill 10x10 tiled trapezoid (216x208 tile)
0.84 Create unmapped window (200 kids)
0.82 Destroy window via parent (16 kids)
0.80 1x1 tiled rectangle (216x208 tile)
0.80 1x1 tiled rectangle (161x145 tile)
0.78 Unmap window via parent (4 kids)

The other operations (x11perf tests more than 300 of them) were neither faster nor slower. The statistical error is hard to guess, but the most accelerated operations were indeed the ones that transfer large continuous blocks between the memory and graphics card.

Bratislava, 14. 5. 1999

Stanislav Meduna
stano (AT) meduna.org