Optimizing Pixomatic for x86 Processors part II http://www.ddj.com/architect/184405765 http://www.ddj.com/184405807 http://www.ddj.com/184405848 http://www.home.comcast.net/~tom_forsyth/papers/pixomatic_gdc2004.ppt Fast way to add null after each char from a string - http://groups.google.com/group/comp.lang.c++/browse_thread/thread/51d0f84dd22ad734?hl=en Move 80 bytes asap - http://codereview.stackexchange.com/questions/5520/copying-80-bytes-as-fast-as-possible Pass by value maybe faster than pass by reference - http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/ Why is java consumer/producer so much faster than C++, a lot of analysis and optimization tips for C++ coding - https://groups.google.com/forum/?hl=en&fromgroups#!topic/comp.lang.c++/7aNw3PzPvMI case study of optimization with asm output - http://roartindon.blogspot.hk/2016/04/boosting-zopfli-performance.html Thread-Local Storage - http://david-grs.github.io/tls_performance_overhead_cost_linux/ There are chance that remove branching make code run faster - http://www.infoq.com/cn/articles/x86-high-performance-programming Other articles in same series - http://www.infoq.com/cn/articles/x86-high-performance-programming-pipeline http://www.infoq.com/cn/articles/x86-high-performance-programming-optimization