Author Topic: Snes9x 1.53 and Git: slow, as I expected  (Read 2960 times)

Nebuleon

  • Guest
Snes9x 1.53 and Git: slow, as I expected
« on: February 27, 2014, 09:01:50 pm »
So, as I wrote in the Releases/PocketSNES based on Snes9x 1.43-dev thread, I have been trying a port of the Git version of Snes9x onto PocketSNES.

It does fix issues with some games, but only by virtue of being more accurate; there's no single "fix" that I could backport for certain games, for example the audio troubles in Secret of Mana selected as the second ROM, or ActRaiser and Earthworm Jim. So more games run, and more games run well. Because it's much more accurate, it also runs all games slower.

For example:



I also ran the emulator under a profiler while getting through the first level of Yoshi's Island with automatic frameskip (to get the pictures above). This is what I got:
Code: [Select]
6.7471  PocketSNES               S9xAPUExecute()
 6.2105  PocketSNES               _ZL18S9xMainLoop_65C816v.8190
 6.0511  PocketSNES               S9xSuperFXExec()
 4.6273  PocketSNES               S9xCheckInterrupts()
 4.5370  PocketSNES               SNES::SPC_DSP::run(int)
 4.4307  PocketSNES               SNES::SPC_DSP::voice_V3c(SNES::SPC_DSP::voice_t*)
 4.2236  PocketSNES               _ZN4SNES3SMP4tickEv.local.3.constprop.491
 4.1864  PocketSNES               _ZL20DrawTile16_Normal1x1jjjj.28359
 3.7773  PocketSNES               S9xDoHEventProcessing()
 2.6510  PocketSNES               SNES::SPC_DSP::voice_V4(SNES::SPC_DSP::voice_t*)
 2.1516  PocketSNES               _ZL27DrawBackdrop16Add_Normal1x1jjj.28182
 1.7851  PocketSNES               _ZL24DrawBackdrop16_Normal1x1jjj.28177
 1.6629  PocketSNES               _Z15S9xDeinitUpdateii.part.1.32095
 1.4610  PocketSNES               SNES::SPC_DSP::voice_V8_V5_V2(SNES::SPC_DSP::voice_t*)
 1.4132  PocketSNES               _Z10S9xGetWordj9s9xwrap_t.constprop.463
 1.3654  PocketSNES               S9xSetPPU(unsigned char, unsigned short)
 1.3122  PocketSNES               _ZL6Op2CM0v.9323
 1.3122  libuClibc-0.9.33.2.so    /lib/libuClibc-0.9.33.2.so
 1.2060  PocketSNES               _ZL14DrawBackgroundihh.15565
 1.1794  PocketSNES               S9xGetPPU(unsigned short)
 1.1157  PocketSNES               S9xGetByte(unsigned int)
 1.0413  PocketSNES               _ZN4SNES7SPC_DSP3runEi.constprop.496
 0.9988  PocketSNES               S9xDoDMA(unsigned char)
 0.9828  PocketSNES               _ZL14addCyclesInDMAh.11571.1955
 0.9616  PocketSNES               _ZL20REGISTER_2118_linearh.11587
 0.8872  PocketSNES               _ZL6OpADM1v.9509
 0.8660  PocketSNES               _ZL20REGISTER_2119_linearh.11592
 0.8181  PocketSNES               _ZL12RenderScreenh.15597.1577
 0.7597  PocketSNES               _ZL6Op30E0v.9032
 0.6747  PocketSNES               _ZL8SetupOBJv.15595
 0.6269  PocketSNES               _ZL12fx_plot_4bitv.14013.1350
 0.5738  PocketSNES               _ZL6OpD0E0v.9014
 0.5100  PocketSNES               _ZN4SNES3SMP4tickEj.local.0.constprop.476
 0.4781  PocketSNES               _Z10S9xSetWordtj9s9xwrap_t15s9xwriteorder_t.constprop.364
 0.3772  PocketSNES               SNES::SPC_DSP::voice_V7_V4_V1(SNES::SPC_DSP::voice_t*)

And then I ran it with frameskip 0 to see if the rendering was a bottleneck:
Code: [Select]
9.7537  PocketSNES               _ZL20DrawTile16_Normal1x1jjjj.28359
 4.8813  PocketSNES               S9xAPUExecute()
 4.8365  PocketSNES               _ZL27DrawBackdrop16Add_Normal1x1jjj.28182
 4.6121  PocketSNES               _ZL14DrawBackgroundihh.15565
 4.5942  PocketSNES               S9xSuperFXExec()
 4.2084  PocketSNES               _ZL18S9xMainLoop_65C816v.8190
 3.5129  PocketSNES               SNES::SPC_DSP::run(int)
 3.4412  PocketSNES               _Z15S9xDeinitUpdateii.part.1.32095
 3.2527  PocketSNES               S9xCheckInterrupts()
 3.1271  PocketSNES               SNES::SPC_DSP::voice_V3c(SNES::SPC_DSP::voice_t*)
 2.9790  PocketSNES               _ZN4SNES3SMP4tickEv.local.3.constprop.491
 2.7592  PocketSNES               _ZL24DrawBackdrop16_Normal1x1jjj.28177
 2.7188  PocketSNES               S9xDoHEventProcessing()
 2.1849  PocketSNES               _ZL12RenderScreenh.15597.1577
 1.7408  libuClibc-0.9.33.2.so    /lib/libuClibc-0.9.33.2.so
 1.7183  PocketSNES               SNES::SPC_DSP::voice_V4(SNES::SPC_DSP::voice_t*)
 1.1844  PocketSNES               S9xSetPPU(unsigned char, unsigned short)
 1.1844  PocketSNES               _Z10S9xGetWordj9s9xwrap_t.constprop.463
 1.0588  PocketSNES               _ZL6Op2CM0v.9323
 0.9422  PocketSNES               SNES::SPC_DSP::voice_V8_V5_V2(SNES::SPC_DSP::voice_t*)
 0.8749  PocketSNES               S9xGetPPU(unsigned short)
 0.8300  PocketSNES               S9xGetByte(unsigned int)
 0.6864  PocketSNES               _ZN4SNES7SPC_DSP3runEi.constprop.496
 0.6326  PocketSNES               _ZL20REGISTER_2119_linearh.11592
 0.6281  PocketSNES               _ZL27DrawClippedTile16_Normal1x1jjjjjj.28333
 0.6102  PocketSNES               S9xUpdateScreen()
 0.6102  PocketSNES               _ZL20REGISTER_2118_linearh.11587
 0.5967  PocketSNES               S9xDoDMA(unsigned char)
 0.5922  PocketSNES               _ZL14addCyclesInDMAh.11571.1955
 0.5294  PocketSNES               S9xSelectTileRenderers(int, unsigned char, unsigned char)
 0.5025  PocketSNES               _ZL8SetupOBJv.15595
 0.4576  PocketSNES               _ZL6OpADM1v.9509
 0.4487  PocketSNES               _ZL6Op30E0v.9032
 0.4352  PocketSNES               _ZL23DrawTile16Add_Normal1x1jjjj.28353
 0.4217  PocketSNES               _ZN4SNES3SMP4tickEj.local.0.constprop.476
 0.3903  PocketSNES               _ZL6OpD0E0v.9014
 0.3544  PocketSNES               SNES::SPC_DSP::voice_V7_V4_V1(SNES::SPC_DSP::voice_t*)
 0.3544  PocketSNES               _ZL12fx_plot_4bitv.14013.1350

I know that some things that were quick in Snes9x 1.43 became more accuracy-focused in Snes9x 1.5x, such as tile rendering and audio chip emulation and synthesis.

On auto frameskip:
Audio chip emulation = 25.98% of total GCW CPU

On frameskip 0:
Tile and screen rendering = 26.34% of total GCW CPU
Audio chip emulation = 18.27% of total GCW CPU

SNES S-CPU interrupt checks are twice slower than in 1.43; opcodes are twice or 3x slower than in 1.43; the S-SMP and S-DSP (together forming the audio chip, or APU) are anywhere from 6x to 15x slower than in 1.43, which could deal with all sound in up to 4% of total GCW CPU; and the tile rendering is 3x slower than in 1.43, mainly due to backdrops.

But here's the kicker: That's after I loosened the timings on S-CPU and SA-1 emulation and made other optimisations in a lot of places in the code. The vanilla Snes9x Git code ran twice slower than this, getting 35 FPS in a simple no-chip game like Super Mario World.

I would like to know if other people are interested in trying to optimise Snes9x Git with me for the GCW Zero. You can see my work thus far at https://github.com/Nebuleon/PocketSNES/commits/snes9x-git-experimental . Make yourself a fork if interested. :)

To be clear: I would like this to become PocketSNES in the future, but its current performance is unacceptable. I will stay with 1.43 until Snes9x Git's performance is acceptable, as defined by the community.
« Last Edit: February 27, 2014, 09:04:59 pm by Nebuleon »

ker

  • **
  • Posts: 601
Re: Snes9x 1.53 and Git: slow, as I expected
« Reply #1 on: February 28, 2014, 10:58:19 pm »
Thank you very much for your effort and for share all this information with us

Nebuleon

  • Guest
Re: Snes9x 1.53 and Git: slow, as I expected
« Reply #2 on: March 30, 2014, 12:32:39 am »
Tried something on the Snes9x 1.43 branch after getting those memory timings (ding!), and it is not much of a speedup.

Basically, I wanted to initiate loads of 8 things at once, and wait only on the 1st. The functions in Snes9x 1.43 are able to draw 4 pixels at once, so I tried loading the pixel palette entries (4 of them) and the depth of the last layer that contributed to each pixel (also 4 of them), and interleave all the logic.

It didn't work.

Then I tried to load 4 pixels at once with a LW (Load Word, 32-bit) instruction, and it slowed down Yoshi's Island by 8-10%. It seems that minimising the number of loads is overshadowed by the adjustments necessary to get each byte, as in In32 = *(uint32_t *) Pixels; uint8_t Pixel_B = (Pixels >> 8) & 0xFF;

I can't do much more with this, it seems.

See:
commit 1d6f942: tile.cpp: Use MIPS assembly for WRITE_4PIXELS16[_FLIPPED], part of DrawTile16. (minor speedup)
commit 1a793f4: tile.cpp: Use 32-bit IO to read pixels and depths in WRITE_4PIXELS16. (slowdown)

 

Post a new topic