Author Topic: MIPS optimizations?  (Read 6046 times)

pinkeen

  • Guest
MIPS optimizations?
« on: July 09, 2010, 09:31:04 pm »
Hello I started writing my own blitter as I suspect that SDL does poor at alpha blended blitting.

So i wrote a "basic", simple, not optimized, reference code to see how it goes.

On my laptop I get (quite nice result for an unoptimized version):
~250fps with SDL
~210fps with my blitter (even though it memcpy's the final image to sdl's screen surface every frame)

On dingoo (dingux) on the other hand I get*:
~90-100fps with SDL
~20fps with my blitter

*no SDL Video - straight to mmaped fb transfer, and this is not the problem since the memcpy-to-sdl-screen version runs with similar performance

This difference is quite mind-numbing since from what I've heard, SDL is highly otimized with hand crafted assembly as far as x86 is concerned, but lacks such optimizations for architectures often used in embedded solutions, i.e. arm, mips.

I use only integer math. Is there something I should know about mips. Any operations that are particularly slow? Any types I should avoid? Any compiler switches I should use?

I tried to compile with various levels of optimization and compiler flags, to no avail.

The blitting function: http://pastebin.com/47kZ5yvb
The macros: http://pastebin.com/6vWWiwq6

Could somebody look at this and tell me what's wrong with that? 50% in comparison to SDL would be pretty slow but acceptable given no optimizations, but 20% of SDL (taking into accound that it's ~85% on the PC)? Come on.

clach04

  • Posts: 256
Re: MIPS optimizations?
« Reply #1 on: July 10, 2010, 08:10:55 am »
this is sort of a wild guess..... is it possible there is an alignment issue? Is the buffer you blitt from byte aligned? My guess is the SDL api ensures things are byte aligned.

see http://en.wikipedia.org/wiki/Data_structure_alignment

some processors/Operating Systems simply crash when unaligned access takes place, others may have a performance hit, others (e.g. x86) don't care. I don't  know what mips is though so I could be leading you astray...

pinkeen

  • Guest
Re: MIPS optimizations?
« Reply #2 on: July 10, 2010, 11:43:40 am »
It's 16 bit image, thus 16bit + alpha is 3 bytes and that data is packed. So there is a address%3bytes=0 access. I'll try to align every pixel (of alpha channeled image) to 4bytes and see how it goes.

Thanks.

It's a pity nobody ported valgrind for dingux, its profiler is great.

PS I don't think that SDL aligns every pixel to 2^n bytes boundary. It's quite a waste of memory. Or maybe it does read more than a px at once. Maybe I should look at SDL's sources...

---

I confirmed that SDL Aligns 24bit images to 4bytes...
« Last Edit: July 10, 2010, 12:08:18 pm by pinkeen »

flatmush

  • Posts: 288
Re: MIPS optimizations?
« Reply #3 on: July 10, 2010, 02:20:31 pm »
I've been thinking about this for a while for sml, for the alpha channel just store the alpha image separately to the color image so that both images are correctly aligned. Your blending is slow because of the massive number of operations it does, you might wanna try using lookup tables for some stuff.

The best accurate way I can think of without using SIMD would be something like this:
Code: [Select]
inline blend(uint16_t src, uint16_t dst, uint8_t a) {
src_r = red_lut[src];
src_g = green_lut[src];
src_b = blue_lut[src];

dst_r = red_lut[dst];
dst_g = green_lut[dst];
dst_b = blue_lut[dst];

d = 255 - a;

src_r *= a;
src_g *= a;
src_b *= a;

dst_r *= d;
dst_g *= d;
dst_b *= d;

src_r += dst_r;
src_g += dst_g;
src_b += dst_b;

src_r = src_r & 0xF800;
src_g = (src_g >> 5) & 0x07E0;
src_b = src_b >> 8

return src_r | src_g | src_b;
}

That is 22 operations per pixel (pretty heavy), yours is probably quite a few more.

Remember, those macros you're using are hiding a lot of the expensive operations that you're doing. Since you're trying to reduce the number of operations with this then you might consider not using the macros or being very careful.

Using SIMD you could perform all those multiplies and adds in parallel but you'd still suffer when trying to convert between rgb565 and rgb888.

I'd find it odd that SDL wouldn't be optimized for MIPS and ARM though since these are the major platforms that would benefit from SDL optimization. I can't imagine this being the first attempt at making it optimal on these platforms.

EDIT: Looking at the SIMD instruction set it won't help that much :(. If anyone from Ingenic is listening: we need a ld565 and a st565 instruction.
« Last Edit: July 10, 2010, 02:33:17 pm by flatmush »

pinkeen

  • Guest
Re: MIPS optimizations?
« Reply #4 on: July 10, 2010, 06:12:57 pm »
I've been thinking about this for a while for sml, for the alpha channel just store the alpha image separately to the color image so that both images are correctly aligned. Your blending is slow because of the massive number of operations it does, you might wanna try using lookup tables for some stuff.

I've emphasized that this is a first ("do not optimize early") unoptimized version. I am aware of all the shortcomings. The thing is that this version runs 85% of SDL on my PC and 20% of SDL on dingoo. That's pretty weird.

I will try to use simd for optimizing this (fetch 4 pixels a time). If this doesn't play well I can try this approach - but this way were limited to 32 "grades" of blend.

flatmush

  • Posts: 288
Re: MIPS optimizations?
« Reply #5 on: July 10, 2010, 06:43:19 pm »
I've emphasized that this is a first ("do not optimize early") unoptimized version. I am aware of all the shortcomings. The thing is that this version runs 85% of SDL on my PC and 20% of SDL on dingoo. That's pretty weird.
Not mocking your attempt, just trying to show you how to optimize it fully.

I will try to use simd for optimizing this (fetch 4 pixels a time). If this doesn't play well I can try this approach - but this way were limited to 32 "grades" of blend.
You can only load 32-bit words with SIMD instructions too (though you can auto-increment). I looked at the SIMD instructions and there's nothing there that can really accelerate this, unless you've seen some clever use of those operations that I haven't. For a start the Q8MUL only works once you convert to RGB888 which is quite an operation heavy task, and then it doesn't handle overflows so far as I can see which is a killer cause you need the upper 8-bits of the result, not the lower.

pinkeen

  • Guest
Re: MIPS optimizations?
« Reply #6 on: July 10, 2010, 08:13:33 pm »
Sorry flatmush, I may have overreacted a bit.

So i aligned every pixel to 4 bytes and it went from 20fps to 50 fps on dingoo! I'll try to push it more without optimizations, cause I think it's still realigning somewhere.

I have a few ideas. Since, we have to align every pixel to 4bytes, this space is lost anyway. I was thinking that we could use this place and leave 8bits between red and blue. That would allow us to do this without losing alpha resolution.

Don't you think that doing array lookups (especially with that small cpu cache) isn't much faster than a bit-and and bit-shift? Have you tested it?

clach04

  • Posts: 256
Re: MIPS optimizations?
« Reply #7 on: July 10, 2010, 09:23:12 pm »
So i aligned every pixel to 4 bytes and it went from 20fps to 50 fps on dingoo! I'll try to push it more without optimizations, cause I think it's still realigning somewhere.

Awesome progress! It will be interesting to see the figures you get once the code is optimized.

pinkeen

  • Guest
Re: MIPS optimizations?
« Reply #8 on: July 10, 2010, 10:35:00 pm »
So i aligned every pixel to 4 bytes and it went from 20fps to 50 fps on dingoo! I'll try to push it more without optimizations, cause I think it's still realigning somewhere.

Awesome progress! It will be interesting to see the figures you get once the code is optimized.

Yep, there's progress but it's still 50% relative to SDL. I wonder why is that. Something's still wrong. I look at how SDL does it. I glanced through the sources and it appears that it's optimized for MMX, SSE, Altivec but no RISC asm so far... But I haven't read all of them yet cause a friend came by... And now's some beer time ;)

EDIT:

I found no ARM or MIPS optimizations in alpha-blended blit routines. What's more, there's only one routine which blits 565 to 565 surface with alpha and it's pure software. Either way it's cleverly written. I think that without all the bloat (that is needed in a cross-platform library but not here) it can be pushed further. When I've got that figured out, I'll start working on color tinting, rotation, scaling and maybe linear filtering as an option for these.

After closer look it seems that this routine from sdl is using a modified version of the approach described by phaeron, sacrificing the accuracy for speed with manually unrolled loop.

Does gcc unroll loops with pointer arithmetic inside?
« Last Edit: July 11, 2010, 01:15:35 am by pinkeen »

pinkeen

  • Guest
Re: MIPS optimizations?
« Reply #9 on: July 16, 2010, 02:35:40 am »
It turned out that sdl stores 16bit + alpha as 4 bytes per pixel so I looked at this function void BlitARGBto565PixelAlpha(SDL_BlitInfo *info). Basically what it does is that it converts the source ARGB8888 and destination RGB565 pixel to 0G0R0B565565 and blends them at once. So I thought that since I'm writing a 16bit optimized library and that space is already wasted I can store the source pixels in that format from the beginning. Since we have only 5bits for R and B there will be no loss of precision on these channels when we have only 5bits of alpha, the only (negligible) loss could take place on the G channel. It's a tradeoff we can live with.

Since the previous benchmark wasn't good enough I wrote a more synthethic one: it just blits a specified amount of images without displaying anything and spits out the time it took. This version easily outperformed SDL on my x86 Intel Core box! I don't know what optimizations my SDL was compiled with since I use ArchLinux, but I'm pretty sure it's at least MMX since the packages are i686. That was impressive so I rushed to test it on dingoo. To my astonishment SDL is still faster there! WTF?

How is this possible? I basically have almost the same code as SDL but with fewer instructions since I don't have to reformat the source pixel. Also I tried with duff's loop unrolling - it only speeds up bigger blits a bit... I tried different compiler options and optimization levels and nothing yielded any results, but I learned that O1 produces exactly the same code as O2 and O3 with this compiler...

So here are the bench results:
x86 - Intel Core Duo 1.6Ghz running stock SDL from ArchLinux repos
Code: [Select]
Starting bench with 5000 blits.
***UFB***
        Testing 5000 random position blits of image of size 24x24
                2.880000 Mpx / 0.020000 s == 144.000000 Mpx/s
        Testing 5000 random position blits of image of size 128x128
                81.920000 Mpx / 0.650000 s == 126.030769 Mpx/s

***SDL***
        Testing 5000 random position blits of image of size 24x24
                2.880000 Mpx / 0.030000 s == 96.000000 Mpx/s
        Testing 5000 random position blits of image of size 128x128
                81.920000 Mpx / 0.720000 s == 113.777778 Mpx/s

x86 with duff's loop
Code: [Select]
Starting bench with 5000 blits.
***UFB***
        Testing 5000 random position blits of image of size 24x24
                2.880000 Mpx / 0.020000 s == 144.000000 Mpx/s
        Testing 5000 random position blits of image of size 128x128
                81.920000 Mpx / 0.580000 s == 141.241379 Mpx/s

***SDL***
        Testing 5000 random position blits of image of size 24x24
                2.880000 Mpx / 0.030000 s == 96.000000 Mpx/s
        Testing 5000 random position blits of image of size 128x128
                81.920000 Mpx / 0.710000 s == 115.380282 Mpx/s

Dingoo A320 - without duff's loop -- -O3 -finline-functions -fomit-frame-pointer -fno-exceptions -mips32 -funroll-loops
Code: [Select]
Starting bench with 5000 blits.
***UFB***
        Testing 5000 random position blits of image of size 24x24
                2.880000 Mpx / 0.260000 s == 11.076923 Mpx/s
        Testing 5000 random position blits of image of size 128x128
                81.919998 Mpx / 8.160000 s == 10.039216 Mpx/s

***SDL***
        Testing 5000 random position blits of image of size 24x24
                2.880000 Mpx / 0.230000 s == 12.521739 Mpx/s
        Testing 5000 random position blits of image of size 128x128
                81.919998 Mpx / 6.010000 s == 13.630615 Mpx/s

The second strange thing is that SDL gets faster with bigger blits and my code gets slower. Since they're almost exactly the same how's this possible?

If you have some spare time, take a look at the code:
ufb_blitter.c
sdl's blitting function
[ASM] ufb_blitter.c compiled for dingoo with debugging symbols
[ASM] ufb_blitter.c compiled for dingoo without debugging symbols

The possible explanations that come to my mind:
- SDL on dingoo is a patched version with some crazy mips optimizations (checked it - not true)
- The compiler is really broken - unlikely - the same compiler/toolchain (as it was also prepared by him) was used by BooBoo to compile the SDL that is used on dingoo right now
- I missed some crucial compiler options, or the build process is screwed in some other way
- The most likely: I have a big plain 'ol BUG

Does anyone have a clue? This is driving me crazy and I've got other projects needing some love.

mth

  • Posts: 317
Re: MIPS optimizations?
« Reply #10 on: July 21, 2010, 03:21:53 pm »
I was wondering if it would be useful to add an 18-bit video mode to OpenDingux. The extra bit of precision for red and blue probably won't matter much, but it would have one pixel spread over 4 bytes: xRGB, where x is unused except the upper bit which must be 0. This would make the SIMD instructions more usable.

mth

  • Posts: 317
Re: MIPS optimizations?
« Reply #11 on: July 21, 2010, 03:26:49 pm »
Have you tried the SDL version with both a hardware and a software surface? I don't know how the caching of the frame buffer is configured, but it's possible it is different from the caching of malloc-ed memory.

Allw

  • Posts: 9
Re: MIPS optimizations?
« Reply #12 on: July 26, 2010, 10:13:16 pm »
I was pretty interrested by that thread about SDL, so even if I don't knwon anything ...
I wanted to ask a few question about your work :

1) The image we are working on are RGB565 or RGB8888 ? Are you just using the RGB8888 as a calculation-trick ?


2) In you first header file you used a macro,
#define RGB888TO565(r, g, b) (((b) >> 3) | (((g) >> 2) << 5) | (((r) >> 3) << 11))

Are you still using it in the last benchmark ? I don't see it in your piece of code ...do you pre-treat your image ?
I read somewehere that marco are slower than inline function .....so that might help
( http://yosefk.com/c++fqa/inline.html )

3) I didnt had time to come down in the code, but it seems that, if we are on RGB565, to work on 2 pixel at once
( http://www.gamedev.net/reference/articles/article1487.asp with additional explanation on http://www.gamedev.net/reference/programming/features/mmxblend/default.asp ) without any simd instructions, and then we should be able to use them.

I hope that might help ....but I think you already know all that !!


pinkeen

  • Guest
Re: MIPS optimizations?
« Reply #13 on: July 26, 2010, 10:57:24 pm »
1) The image we are working on are RGB565 or RGB8888 ? Are you just using the RGB8888 as a calculation-trick ?

They are RGB565, but the source image is in something like AG0R0B565565.

2) In you first header file you used a macro,
#define RGB888TO565(r, g, b) (((b) >> 3) | (((g) >> 2) << 5) | (((r) >> 3) << 11))

Are you still using it in the last benchmark ? I don't see it in your piece of code ...do you pre-treat your image ?
I read somewehere that marco are slower than inline function .....so that might help
( http://yosefk.com/c++fqa/inline.html )

I don't use that macro anymore. As for which is faster it depends on many things - from what I know, usually they will be equally fast and with inlined functions you don't have the guarantee that the compiler will actually inline the code.

3) I didnt had time to come down in the code, but it seems that, if we are on RGB565, to work on 2 pixel at once
( http://www.gamedev.net/reference/articles/article1487.asp with additional explanation on http://www.gamedev.net/reference/programming/features/mmxblend/default.asp ) without any simd instructions, and then we should be able to use them.

I already work on all three color components at once. I can't do more than once pixel at once since I need the whole 32bits for this (must have enough headroom for the multiply result).


Recent developments and stuff here:
http://www.gp32x.com/board/index.php?/topic/55158-mips-specific-optimizations/

Sorry I'm kinda lost when trying to keep track with both forums and post updates on both (I always have to reread whole topic to know what to post so just a link here).

 

Post a new topic
Post a new topic