Author Topic: TCmalloc (Google's gperftools) built with GCW MIPS32r2 uClibc toolchain  (Read 3459 times)

Senor Quack (OP)

  • Posts: 223
I recently cross-compiled Google's gperftools, which includes the TCmalloc drop-in malloc/free replacement library and figured I'd provide a download link here to save others the trouble. TCmalloc is a very high-performance thread-friendly heap allocator that can do lock-less allocations on some platforms, though I am not sure if that is the case under our MIPS/uClibc platform. It can also sometimes lower overall heap usage and fragmentation. Further benefits to using gperftools/tcmalloc are its profiling and debugging capabilities.

GCW build tarball download link

Extract in the same folder in which your gcw0-toolchain/ subfolder resides.

Basic usage: Link against whichever of the tcmalloc implementations provided best suits your purposes. If linking against the standard tcmalloc lib, add the following to your final linkage command: 
Code: [Select]
-Wl,-R -Wl,. -ltcmalloc
NOTE: -ltcmalloc in the linkage command above should come after all your other libraries you are linking against, or else tcmalloc's debugger/profiler capabilities might not detect all the stuff it should. The -R stuff above will cause the GCW's dynamic loader to look for libraries located in the same folder your binary is placed in. This means, of course, that when copying your newly-built program over to the GCW, you must also include in the same folder a copy of the correct .so dynamic library. As an example, if linking against standard tcmalloc (-ltcmalloc), you would include a copy of gcw0-toolchain/usr/mipsel-gcw0-linux-uclibc/sysroot/usr/lib/libtcmalloc.so.4 next to your executable. You can also link against other versions of tcmalloc that gperftools provides, by replacing -ltcmalloc with -ltcmalloc_debug or -ltcmalloc_and_profiler, etc etc.

Add the following to your GCC/G++ compilation flags for maximum compatibility with gperftools: 
Code: [Select]
-fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free

Quick and dirty usage: If you want, you can try using the LD_PRELOAD trick to force loading any library provided, which with tcmalloc should override any usage of malloc/free in your program if it was linked dynamically to libc. It's not as effective and not recommended by Google, so I won't talk more about it here.

Building gperftools yourself: This is tricky if you are using the 2014-08-20 OpenDingux GCW toolchain, because it will barf when building the profiling stuff, complaining about missing symbols in siginfo.h. You will need a newer patched version of siginfo.h from uClibc's official repo I am providing until a newer GCW toolchain comes out. You'll need to install autoconf, automake, and libtool on your host system. In the gperftools dir, run ./autogen.sh which creates some files along with a configure script. The script requires you to have the GCW toolchain in your path, so add the following to your ~/.bash_profile file, creating it if doesn't exist:
Code: [Select]
export PATH=/opt/gcw0-toolchain/usr/bin/:$PATH
Load a new shell or run the command 'source ~/.bash_profile' in your current shell so the path change takes effect. Next, run the configure command with following parameters, assuming your toolchain is in /opt/gcw0-toolchain/
Code: [Select]
./configure CC=/opt/gcw0-toolchain/usr/bin/mipsel-gcw0-linux-uclibc-gcc CFLAGS="-O2" CXX=/opt/gcw0-toolchain/usr/bin/mipsel-gcw0-linux-uclibc-g++ CXXFLAGS="-O2" CPP=/opt/gcw0-toolchain/usr/bin/mipsel-gcw0-linux-uclibc-cpp CXXCPP=/opt/gcw0-toolchain/usr/bin/mipsel-gcw0-linux-uclibc-cpp --exec-prefix="/opt/gcw0-toolchain/usr/mipsel-gcw0-linux-uclibc/sysroot/usr"  --prefix="/opt/gcw0-toolchain/usr/mipsel-gcw0-linux-uclibc/sysroot/usr" --host=mipsel-gcw0-linux-uclibc After that, run 'make', then 'make install'.
« Last Edit: April 09, 2016, 05:35:19 pm by Senor Quack »

David Knight

  • Posts: 577
Thankyou for sharing, some comparisons suggest a several fold improvement in malloc performance. I will try it out.

Also, I have just enrolled on a computing degree course, looking forward to studying computing in earnest  :)

Senor Quack (OP)

  • Posts: 223
Thankyou for sharing, some comparisons suggest a several fold improvement in malloc performance. I will try it out.

Also, I have just enrolled on a computing degree course, looking forward to studying computing in earnest  :)

That's great, I'm also trying to go back and finish up a CS degree after quite a few years away from school.. I'm currently trying to focus on a lot of the newer C++ features and multithreading.

A more effective way to improve performance in parts of a game that make a lot of calls to new/delete/malloc/free is to use a pooled memory manager, using multiple specialized pools for different sizes of objects. Boost offers a general purpose one, but I wrote one for my ongoing port of Hurrican that is simpler to use while still being templatized. I'll post a thread about it once I release the game.
« Last Edit: April 06, 2016, 05:10:08 am by Senor Quack »

howie_k

  • Posts: 157
Great to hear someone is working on porting Hurrican! :)

Senor Quack (OP)

  • Posts: 223
Great to hear someone is working on porting Hurrican! :)

Pickle deserves the larger part of the credit, as he did the SDL/OpenGL and GLES port from the original DirectX code, which was a ton of hard work. I've done the 320x240 low-res support and rewrote all the rendering code so that it batches draw calls. I've also rewritten large portions of the game, further optimizing it for speed and fixing many many bugs in the original code. Look for it soon!

Quickman

  • Posts: 220
Great to hear someone is working on porting Hurrican! :)
Curious, is hurrican the same as that fan game Turrican? Because if it is, it looks pretty dang incredible!

howie_k

  • Posts: 157
@Quickman, sure is! See here;
http://www.winterworks.de/project/hurrican/

Senor Quack (OP)

  • Posts: 223
Yep, it's indeed the same Hurrican homebrew game from the guys at Poke53280 group 'Eiswuxe' and 'Turri'. Eiswuxe has a new game out on Steam and Android and I think also coming for Nintendo platforms, check it out:

https://play.google.com/store/apps/details?id=de.eiswuxe.blookid2&hl=en
http://store.steampowered.com/app/379640/   (Steam version is Windows-only for now)
http://www.winterworks.de/project/bloo-kid-2/
« Last Edit: April 08, 2016, 05:43:39 am by Senor Quack »

Quickman

  • Posts: 220
Awesome, thanks guys! Looks fantastic!

toto

  • Posts: 147
Guys, sorry for m'y ignorance but what is TCmalloc about? In simple word what improvment for the GCW can WE expect from it? Thanks for your hard work!

Senor Quack (OP)

  • Posts: 223
Re: TCmalloc (Google's gperftools) built with GCW MIPS32r2 uClibc toolchain
« Reply #10 on: April 09, 2016, 05:31:01 pm »
Guys, sorry for m'y ignorance but what is TCmalloc about? In simple word what improvment for the GCW can WE expect from it? Thanks for your hard work!

It wasn't hard work at all, I just cross-compiled the library for my own use/experimentation,  and since I had to fix one issue that took an hour or so to figure out, I figured I might as well provide the package for other programmers here to play with if they like.

TL;DR version
As for benefits to end-users, I don't think you can expect much. TCmalloc's speed benefits come when a program has many threads trying to allocate memory at once, something like a database or HTTP server. Unless a game was coded poorly, and is somehow making several hundred or more calls to malloc/new a frame, or launching many threads, there's little TCmalloc could do to speed things up. We don't have multiple CPU cores anyway, so any game that uses multiple threads to increase performance is already going to be slow by design. There is a possibility that TCmalloc might help speed up our 3D graphics library, if a game is making many separate draw calls per frame, but I haven't yet experimented with that or even determined how many malloc() calls our 3D library is making for a given set of draw calls, or even if the number of malloc() calls varies at all depending on complexity of requests coming in.

Longer version:
TCMalloc replaces the default malloc/free implementation when you link it into a program. Malloc handles dynamic memory allocation for a program, i.e., whenever a program requests a new chunk of memory to use while it is running, it calls malloc() to ask for it. Malloc will either give the program a chunk of free memory it's already holding, or if necessary it will ask the OS for a large chunk of memory, take control of it, and give the program some of it.  When the program is done with the chunk of memory, it calls free() and the malloc implementation will either hold onto that chunk for future use by the program, or decide to return it back to the OS.  Keeping track of all these various memory requests can take quite some time for the CPU to perform, stealing time away from what the program needs to be doing. Further adding to the overhead is the fact that malloc implementations must support concurrent access from multiple threads of execution.

This implementation of malloc has debugging and profiling features that some programmers here might find useful. TCMalloc is also known to be much faster for programs with many threads contending for access to the allocator. It might even be faster for single-threaded programs if it helps to avoid holding a mutex while allocating memory, since our system likely doesn't have the fastest mutex support, and TCMalloc's internals are lock-less where possible, and where it's not, likely preferring atomic locks over more costly mutexes.  It also is more sophisticated than previous malloc implementations in how it handles requests for differently-sized chunks of memory, possibly leading to less usage of RAM by a program, by reducing fragmentation and reducing size of internal metadata. But we'd have to benchmark it to find out how much benefit, if any, it has over our default malloc implementation. I have not done that (yet).

Usually, a well-coded program like a good emulator or game will already be mindful of the overhead of calling malloc/free and instead use statically-allocated memory like arrays or memory pools to satisfy its needs.  So you likely won't see benefits come from something like TCMalloc. When a program is not very well-coded and is making lots of malloc/free requests, it's better to first try to increase performance by eliminating as many calls to malloc/free as possible, through use of mempools/arrays, rather than use a different malloc implementation.
« Last Edit: April 09, 2016, 09:24:09 pm by Senor Quack »

 

Post a new topic
Post a new topic