AnTuTu | Mostly-Tech

The Best Android Mobile Benchmarks

July 30, 2016 1 Comment

Although there is strong evidence that higher benchmark scores do not always translate into real-world performance gains, the best benchmarks do serve a purpose. They are not subjective and can expose differences in hardware performance. Unfortunately, there are over fifty different mobile benchmarks to choose from, and picking the best ones is not easy. There are several reasons for this. Some benchmarks have serious problems and do not produce meaningful results. Other benchmarks haven’t been updated in years and should be taken down. A few benchmarks are bankrolled by companies with a long history of cheating. In the end, my research showed there are 4 or 5 benchmarks that really stand out and deserve a “Best” rating. A few others are very good, but not quite as good as the those with a “Best” rating. I found quite a few benchmarks that are good for certain things, so I placed them in their own category. Lastly, I found more than a few apps that have serious flaws and shouldn’t be used. Here are rankings of mobile benchmarks that will help you determine which to use, and which to avoid. This article is focused on Android benchmarks, but quite a few of these are available for iOS as well.

Capture1

Best

3DMark (Sling Shot) – One of the best GPU benchmarks. It incorporates volumetric lighting and particle illumination, as well as depth of field and bloom post-processing effects. Expect really low frame rates on the graphic tests. Although Sling Shot includes several good tests where physics including simulated worlds and particle systems are computed on the CPU, it isn’t the best benchmark for overall CPU performance. A useful graph is displayed after the test is complete, which plots the CPU frequency, temperature, frame rate for each of the tests. Scores vary depending on which of the modules you run. Even though the tests appear similar, scores from ES 3.1 mode should not be compared to scores from ES 3.0 mode. Requires Android 5.0 (or later).
GFXBench (formerly GL Benchmark) – This suite of 14 different tests is one of the best GPU benchmarks. Its “Car Chase” test was the first to test devices with hardware tessellation support. It also includes HDR tone mapping bloom, lens flares, particles, motion blur and more. Issue: Temperature and clock speed are not reported on devices like the Nexus 6.
PCMark for Android – This benchmark measures the performance and battery life of an Android device browsing the web, editing photos, watching videos and working with documents. Real applications are used, so the results are supposed to reflect real-world performance. The “Work battery life” test measures the time required to drain the battery in a device from full charge to 20%. This benchmark is useful, but isn’t a true test of processor efficiency, because the end result has a lot to do with the capacity of the battery in the device. Still, it’s one of the best battery tests.
Vellamo – One of the best mobile web benchmarks. Although it’s known for its HTML5 and Javascript browser performance tests, the Browser Chapter also includes SunSpider and Google’s Octane benchmark, as well as page load, text reflo, scrolling and crypto tests. Vellamo also has a good collection of multi-core benchmarks (Multicore chapter) which include Linpack, Sysbench and Threadbench. Lastly, Vellamo’s Metal chapter includes the Dhrystone and Linpack benchmarks, as well as storage and RAM memory tests. It should be mentioned that the person who created and maintains Vellamo is an employee of Qualcomm, although I’ve never seen any evidence Vellamo’s browser tests favor Snapdragon processors.

Very Good

Androbench – A good way to measure the storage performance on an Android device. Measures sequential reads/writes and random reads/writes.
Geekbench 4 – One of the better single-core CPU benchmarks. Geekbench also tests memory and multi-core performance. Geekbench 4 also includes new GPU compute tests, although it’s too soon to say how good these tests are. Requires Android 5.0 or later.
JetStream – A relatively new JavaScript benchmark that is similar to Vellamo and PCMark’s Web Browsing test. Effectively replaces SunSpider and Octane because it includes SunSpider 1. 0.2 and Octane 2. Its makers claim it is better because “each benchmark measures a distinct workload, and no single optimization technique is sufficient to speed up all benchmarks.” Latency tests confirm that a web application can start quickly, ramp up to peak performance, and run smoothly without interruptions. Throughput tests measure the sustained peak performance of a web application. It’s supposed to be less easy to game because aggressive optimizations for one benchmark could make another benchmark slower.

Useful in Some Cases

4GMark – A speed and quality of service benchmark for 2G/3G/4G cellular and Wi-Fi networks. After testing, you can compare your results against other users in your country, area, or the same device.
AndEBench-Pro 2015 – A suite of tests measuring CPU, GPU, memory and storage performance. It also gauges XML parsing, GUI rendering, image manipulation, data compression and cryptography tasks embedded in actual workloads. This benchmark is a product of EEMBC, which is led by Intel. This apps is based on AndEBench, which gets only 3 stars in Google Play. It was last updated in 2015 for Lollipop and is overdue for an update.
AnTuTu 6.0 – Better as a CPU test than a GPU test. AnTuTu is also not a good indicator of performance changes over time because its scores sometimes change dramatically as new versions are released. For example, bloggers benchmarking the Snapdragon 820 with AnTuTu 6.0 saw scores over 130,000. At the same event, on the same hardware, AnTuTu 5.7 reported scores around 70,000. That’s almost a 2x increase, which makes this benchmark very misleading. Although some of the best tech bloggers (e.g. AnandTech, Engadget and Ars Technica) no longer use AnTuTu, it’s still one the most popular Android benchmarks and the one that handset manufacturers like Samsung value the most. It also has more users than any other benchmark. For these reasons, I’m not moving it to the ‘Not Recommended’ section of this article – even though it probably deserves to be there.
Basemark ES 3.1 – Measures the OpenGL ES 3.1 graphics performance of your device. Also provides four results: Lighting, Compute, Instancing and Post-Processing. It’s part of their Basemark GPU Mobile test suite , which has a Pro version that reports FPS and other stats. I considered moving this to the “Good” section of this article, but after reading its mixed reviews and seeing that it doesn’t run on most mobile devices, I’m leaving it here for now.
DiscoMark – This little-known benchmark measures the launch-times of applications that you select. On the plus side, this test reflects the real-world performance of your phone. On the negative side, comparisons are meaningless, unless the same apps are selected.
Basemark GUI Free – Performs vertex streaming and blending performance measurements. Its vertex test is good, although the blend test is not great. It also hasn’t been updated since 2014.
Basemark X – A decent cross-platform graphics benchmark based on the Unity 4.2 game engine. This used to be one of the more demanding graphic benchmarks, but it hasn’t been updated since 2014, so it’s showing signs of age. Its off-screen test is also not completely resolution independent.
CF-Bench – A CPU and memory benchmark designed for multi-core devices. Although it produces a “final” score, its creators say you should take those with a grain of salt. Hasn’t been updated since 2013.
CompuBench RS – A RenderScript benchmark that tests compute performance of Android mobile devices. Still being updated, but not very popular.
Dhrystone– An older synthetic computing benchmark program which provides an indication of CPU “integer” performance. This benchmark isn’t a good indicator of performance but is still used by some chip manufacturers as a load to determine peak power consumption. Dhystone 2.1 is part of Vellamo’s Metal Chapter.
Epic Citadel isn’t a traditional benchmark, but it does have a “benchmark mode,” which reports an average frame rate after a game loop runs. I feel this app is useful because its graphics are representative of the real world games.
GameBench – GameBench is one of the more popular FPS testing apps. However it must run for for 10-15 minutes in order to get a frame rate reading, and there is evidence the results are not always accurate. I wrote an article that compares GameBench with others apps that report frame rates. GameBench is “App 2” in these tests. Note: This app also has privacy issues. It sends your email address, test scores and other personal data to the cloud where paid users can access it.
Google Octane– A good test of JavaScript performance in browsers. This test is part of JetStream and the Vellamo Browser Chapter, so most users won’t need to run it.
Kraken– Yet another Java script benchmark. Still used by Ars Technica and some other bloggers.
SPECint 2006 – This benchmark is used by chip manufacturers and OEMs to measure CPU performance. It contains 12 different benchmark tests that stress a system’s processor and memory subsystem. The reason this benchmark doesn’t appear in the above sections is because it’s not available in Google Play and costs $800.
TabletMark – An automated tool that evaluates system performance on a range of activities, which include Web browsing, email, photo, video sharing and playback. Also includes a day-in-a-life battery test which includes idle time. While this benchmark sounds interesting, it’s worth mentioning that this app has less than 1000 downloads and a 3.6 star rating.
Trepn Profiler – This app isn’t a benchmark, but it reports accurate power readings and displays the processor frequencies as an overlay on any app. This is a good way to see whether your processor is throttling under a heavy load. When a processor is overworked and gets too hot, its frequency is reduced, which causes a drop in performance. The only reason this app is not in the “good” category is because newer mobile processors (including the Snapdragon 808, 810, 820 and 821) have a PMIC that only reports power readings every 30 seconds and this can affect the accuracy of average power readings. [Disclosure: I was involved in the creation of this product]

Not Recommended

3DMark (Ice Storm) – Not the best test of advanced GPU performance. 3DMark Sling Shot has effectively replaced this test.
AnTuTu 4.0 – This app has heavy vertex shader complexity that is unlike real-world games. It also has no consideration for tiled rendering architectures.
AnTuTu 5.0 – This app’s 2D tests are not representative of the real-world games. This version was replaced by AnTuTu 6.0, which is better.
Basemark OS II – A system-level benchmark designed to measuring overall performance. In addition to its overall score, four different areas are also evaluated including system, memory, graphics, and web browsing. Not recommended because the rankings on this Powerboard web site aren’t credible and their free version is missing several features promoted on their product page. Battery Test and External Memory Tests are available in their Full version, but I can’t find a camera test anywhere. Also, this benchmark hasn’t been updated since 2014 and it is one of the lower-ranked popular benchmarks on Google Play (3.9 stars).
BenchmarkPi – One of several benchmarks that measures performance by calculating Pi. This benchmark isn’t recommended because it only tests the CPU and is no longer used by most bloggers. It also hasn’t been updated since 2009.
BenchmarkXPRT – A collection of different benchmarks. You won’t find the word “Intel” on the BenchmarkXPRT website, but if you check the small print on some Intel websites you’ll find they admit “Intel is a sponsor and member of the BenchmarkXPRT Development Community, and was the major developer of the XPRT family of benchmarks.” Intel also says “Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.” Bottom line: Intel made these benchmarks to make Intel processors look good and other processors look bad. This benchmark should not be used.
BrowserMark – A cross-platform browser benchmark with issues that make cross-platform comparisons questionable.
CaffeineMark – A series of online tests that measure the speed of Java programs. CaffeineMark scores roughly correlate with the number of Java instructions executed per second, and are not supposed to depend on the amount of memory available or the speed of the Internet connection. Not recommended, because this test was created in 1997 for PCs and their Android app hasn’t been updated since 2011. Much better JavaScript benchmarks now exist.
CompuBench CL Mobile – Tests the compute performance of Android mobile devices supporting OpenCL. Tests include face detection, particle simulation, fractal rendering, ambient occlusion, raycast, gaussian blur and histogram normalization. Crashes on many devices. As a result it has a poor rating and cannot be installed on most devices.
Google V8 – Another browser benchmark focused on JavaScript performance. Was effectively replaced by Google Octane because it adds five tests on top of the ones already in V8.
Linpack – Measures the floating point performance of the CPU. Linpack is part of the Vellamo’s Multicore and Metal tests, so it’s not really needed. It’s also no longer used by most bloggers who benchmark and hasn’t been updated since 2011.
MobileXPRT – You won’t find the word “Intel” on the BenchmarkXPRT website, but if you check the small print on some Intel websites you’ll find they admit “Intel is a sponsor and member of the BenchmarkXPRT Development Community, and was the major developer of the XPRT family of benchmarks.” Intel also says “Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.” Bottom line: Intel made these benchmarks to make Intel processors look good and other processors look bad. MobileXPRT should not be used.
Nenamark 1 – An OpenGL ES 2.0 graphic benchmark that is meaningless, because all modern devices hit its 60fps framerate limit. Hasn’t been updated since 2011.
Nenamark 2 – An OpenGL ES 2.0 graphic benchmark that is supposed to have more advanced effects and higher resolution graphics than NenaMark1. Hasn’t been updated since 2012.
Nenamark 3 – Another OpenGL ES benchmark that is supposed to continuously grow more complex until the system cannot handle it any more. However, it doesn’t allow you to change the resolution, so a phone with a very high-resolution screen is likely to perform worse than a budget phone with a low-resolution screen. This is also why it favors iPhone over Android flagships like the Nexus 6P.
Passmark – Tests CPU, storage, 2D graphics, 3D graphics, storage and memory performance. Hasn’t been updated since 2013.
Pi – Calculates how long it takes to calculate Pi up to 10 million digits. Not a useful benchmark because it only measures one thing.
Quadrant Standard Edition – Mostly a CPU benchmark, although it also claims to test CPU, memory, I/O and graphics. Hasn’t been updated since 2012.
Smartbench – A multi-core-friendly benchmark that measures overall performance. Tests productivity and gaming. Last updated in 2012. Poorly rated on Google Play (3.8 stars).
SunSpider – The JavaScript benchmark SunSpider is no longer being updated. Its creators recommend JetStream. Even when it was still popular, the data that SunSpider used was so small that it was more of a cache test than a JavaScript benchmark.
WebXPRT– You won’t find the word “Intel” on the BenchmarkXPRT website, but if you check the small print on some Intel websites you’ll find they admit “Intel is a sponsor and member of the BenchmarkXPRT Development Community, and was the major developer of the XPRT family of benchmarks.” Intel also says “Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.” Bottom line: Intel made these benchmarks to make Intel processors look good and others look bad. WebXPRT should not be used.

I hope you find this article to be of use. If you have any comments please enter them below.

– Rick

Filed under Blog Posts, Mobile Technology Tagged with 3DMark, Android, AnTuTu, benchmark, GFXBench, Mobile, Vellamo

The Dirty Little Secret About Mobile Benchmarks

September 29, 2012 30 Comments

This article has had almost 30,000 views. Thanks for reading it.

When I wrote this article over a year ago, most people believed mobile benchmarks were a strong indicator of device performance. Since then a lot has happened: Both Samsung and Intel were caught cheating and some of the most popular benchmarks are no longer used by leading bloggers because they are too easy to game. By now almost every mobile OEM has figured out how to “game” popular benchmarks including 3DMark, AnTuTu, Vellamo 2 and others. Details. The iPhone hasn’t been called out yet, but Apple has been caught cheating on benchmarks before, so there is a high probability they are employing one or more of the techniques described below like driver tricks. Although Samsung and the Galaxy Note 3 have received a bad rap over this, the actual impact on their benchmark results was fairly small, because none of the GPU frequency optimizations that helped the Exynos 5410 scores exist on Snapdragon processors. Even when it comes to the Samsung CPU cheats, this time around the performance deltas were only 0-5%.

11/26/13 Update: 3DMark just delisted mobile devices with suspicious benchmark scores. More info.

2/1/17 Update: XDA just accused Chinese phone manufacturers of cheating on benchmarks. You can read the full article here.

Mobile benchmarks are supposed to make it easier to compare smartphones and tablets. In theory, the higher the score, the better the performance. You might have heard the iPhone 5 beats the Samsung Galaxy S III in some benchmarks. That’s true. It’s also true the Galaxy S III beats the iPhone 5 in other benchmarks, but what does this really mean? And more importantly, can benchmarks really tell us which phone is better than another?

Why Mobile Benchmarks Are Almost Meaningless

Benchmarks can easily be gamed – Manufacturers want the highest possible benchmark scores and are willing to cheat to get them. Sometimes this is done by optimizing code so it favors a certain benchmark. In this case, the optimization results in a higher benchmark score, but has no impact on real-world performance. Other times, manufacturers cheat by tweaking drivers to ignore certain things, lower the quality to improve performance or offload processing to other areas. The bottom line is that almost all benchmarks can be gamed. Computer graphics card makers found this out a long time ago and there are many well-documented accounts of Nvidia, AMD and Intel cheating to improve their scores.Here’s an example of this type of cheating: Samsung created a white list for Exynos 5-based Galaxy S4 phones which allow some of the most popular benchmarking apps to shift into a high-performance mode not available to most applications. These apps run the GPU at 532MHz, while other apps cannot exceed 480MHz. This cheat was confirmed by AnandTech, who is the most respected name in both PC and mobile benchmarking. Samsung claims “the maximum GPU frequency is lowered to 480MHz for certain gaming apps that may cause an overload, when they are used for a prolonged period of time in full-screen mode,” but it doesn’t make sense that S Browser, Gallery, Camera and the Video Player apps can all run with the GPU wide open, but that all games are forced to run at a much lower speed.Samsung isn’t the only manufacturer accused of cheating. Back in June Intel shouted at the top of their lungs about the results of an ABI Research report that claimed their Atom processor outperformed ARM chips by Nvidia, Qualcomm and Samsung. This raised quite a few eyebrows and further research showed the Intel processor was not completely executing all of the instructions. After released an updated version of the benchmark, Intel’s scores dropped overnight by 20% to 50%. Was this really cheating? You can decide for yourself — but it’s hard to believe Intel didn’t know their chip was bypassing large portions of the tests AnTuTu was running. It’s also possible to fake benchmark scores as in this example.Intel has even gone so far as to create their own suite of benchmarks that they admit favor Intel processors. You won’t find the word “Intel” anywhere on the BenchmarkXPRT website, but if you check the small print on some Intel websites you’ll find they admit “Intel is a sponsor and member of the BenchmarkXPRT Development Community, and was the major developer of the XPRT family of benchmarks.” Intel also says “Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.” Bottom line: Intel made these benchmarks to make Intel processors look good and others look bad.
Benchmarks measure performance without considering power consumption – Benchmarks were first created for desktop PCs. These PC were always plugged into the wall, had multiple fans and large heat-sinks to dissipate the massive amounts of power they consumed. The mobile world couldn’t be more different. Your phone is rarely plugged into the wall — even when you are gaming. Your mobile device is also very limited on the amount of heat it can dissipate and battery life drops as heat increases. It doesn’t matter if your mobile device is capable of incredible benchmark scores if your battery dies in only an hour or two. Mobile benchmarks don’t factor in the power needed to achieve a certain level of performance. That’s a huge oversight, because the best chip manufacturers spend incredible amounts of time optimizing power usage. Even though one processor might slightly underperform another in a benchmark, it could be far superior, because it consumed half the power of the other chip. You’d have no way to know this without expensive hardware capable of performing this type of measurements.

Benchmarks rarely predict real-world performance — Many benchmarks favor graphics performance and have little bearing on the things real consumers do with their phones. For example, no one watches hundreds of polygons draw on their screens, but that’s exactly the types of things benchmarks do. Even mobile gamers are unlikely to see increased performance on devices which score higher, because most popular games don’t stress the CPU and GPU the same way benchmarks do. Benchmarks like GLBenchmark 2.5 focus on things like high-level 3D animations. One reviewer recently said, “Apple’s A6 has an edge in polygon performance and that may be important for ultra-high resolution games, but I have yet to see many of those. Most games that I’ve tried on both platforms run in lower resolution with an up-scaling.” For more on this topic, scroll down to the section titled: “Case Study 2: Is the iPhone 5 Really Twice as Fast?”This video proves shows that the iPhone 5s is only slightly faster than the iPhone 5 when it comes to real-world tests. For example, The iPhone 5s only starts up only 1 second faster than the iPhone 5 (23 seconds vs. 24 seconds). The iPhone 5s only loads the Reddit.com site 0.1 seconds faster than the iPhone 5. These differences are so small it’s unlikely anyone would even notice them. Would you believe the iPhone 4 shuts down five times faster than the iPhone 5s? It’s true (4 seconds vs. 21.6 seconds). Another video shows that even though the iPhone 5s does better on most graphics benchmarks, when it comes to real world things like scrolling a webpage in the Chrome browser, Android devices scroll significantly faster than a iPhone 5s running iOS 7.See for yourself in this video.

The iPhone 5s appears to do well on graphics benchmarks until you realize that Android phones have almost 3x the pixels

Some benchmarks penalize devices with more pixels — Most graphic benchmarks measure performance in terms of frames per second. GFXBench (formerly GLBenchmark) is the most popular graphics benchmark. Apple has dominated in the scores of this benchmark for one simple reason. Apple’s iPhone 4, 4S, 5 and 5s displays all have a fraction of the pixels flagship Android devices have. For example, in the chart above, the iPhone 5s gets a score of 53 fps, while the LG G2 gets a score of 47 fps. Most people would be impressed by the fact that the iPhone 5s got a score that was 12.7% higher than the LG G2, but when you consider the fact the LG G2 is pushing almost 3x the pixels (2073600 pixels vs. 727040 pixels), it’s clear the Adreno 330 GPU in the LG G2 is actually killing the GPU in the iPhone 5s. The GFXBench scores on the 720p Moto X (shown above) are further proof that what I am saying is true. This bias against devices with more pixels isn’t just true with GFXBench, you can see the same behavior with graphics benchmarks like Basemark X shown below (where the Moto X beats the Nexus 4).

More proof that graphics benchmarks favor devices with lower-res displays

Some popular benchmarks are no longer relevant — SunSpider is a popular JavaScript benchmark that was designed to compare different browsers. However, according to at least one expert, the data that SunSpider uses is a small enough benchmark that it’s become more of a cache test. That’s one reason why Google came out with their V8 and Octane benchmark suites, both are better JavaScript tests than SunSpider.” According to Google, Octane is based upon a set of well-known web applications and libraries. This means, “a high score in the new benchmark directly translates to better and smoother performance in similar web applications.” Even though it may no longer be relevant as an indicator of Java-script browsing performance, SunSpider is still quoted by many bloggers. SunSpider isn’t the only popular benchmark with issues, this blogger says BrowserMark also has problems.

SunSpider is a good example of a benchmark which may no longer be relevant — yet people continue to use it

Benchmark scores are not always repeatable – In theory, you should be able to run the same benchmark on the same phone and get the same results over and over, but this doesn’t always occur. If you run a benchmark immediately after a reboot and then run the same benchmark during heavy use, you’ll get different results. Even if you reboot every time before you benchmark, you’ll still get different scores due to memory allocation, caching, memory fragmentation, OS house-keeping and other factors like throttling.Another reason you’ll get different scores on devices running exactly the same mobile processors and operating system is because different devices have different apps running in the background. For example, Nexus devices have far less apps running in the background than a non-Nexus carrier-issued devices. Even after you close all running apps, there are still apps running in the background that you can’t see — yet these apps are consuming system resources and can have an affect on benchmark scores. Some apps run automatically to perform housekeeping for a short period and then close. The number and types of apps vary greatly from phone to phone and platform to platform, so this makes objective testing of one phone against another difficult.Benchmark scores sometimes change after you upgrade a device to a new operating system. This makes it difficult to compare two devices running different versions of the same OS. For example, the Samsung Galaxy S III running Android 4.0 gets a Geekbench score of 1560, which the same exact phone running Android 4.1 gets Geekbench score of 1781. That’s a 14% increase. The Android 4.4 OS causes many benchmark scores to increase, but not in all cases. For example, after moving to Android 4.4, Vellamo 2 scores drop significantly on some devices because it can’t make use of some aspects of hardware acceleration due to Google’s changes.
Perhaps the biggest reason benchmark scores change over time is because they stress the processor increasing its temperature. When the processor temperature reaches a certain level, the device starts to throttle or reduce power. This is one of the reasons scores on benchmarks like AnTuTu change when they are run consecutive times. Other benchmarks have the same problem. In this video, the person testing several phones gets a Quadrant Standard score on the Nexus 4 that is 4569 on the first run and 4826 on a second run (skip to 14:25 to view).

Not all mobile benchmarks are cross-platform — Many mobile benchmarks are Android-only and can’t help you to compare an Android phone to the iPhone 5. Here are just a few popular mobile benchmarks which are not available for iOS and other mobile platforms (e.g. AnTuTu Benchmark, Octane, Neocore, NenaMark, Quadrant Standard and Vellamo).

Some benchmarks are not yet 64-bit — Android 5.0 supports 64-bit apps, but most benchmarks do not run in 64-bit mode yet. There are a few exceptions to this rule. A few Java-based benchmarks (Linpack, Quadrant) run in 64-bit mode and do see performance benefits on systems with 64-bit OS and processors. AnTuTu also supports 64-bit.

Mobile benchmarks are not time-tested — Most mobile benchmarks are relatively new and not as mature as the benchmarks which are used to test Macs and PCs. The best computer benchmarks are real world, relevant and produce repeatable scores. There is some encouraging news in this area however — now that 3DMark is available for mobile devices. It would be nice if someone ported other time-tested benchmarks like SPECint to iOS as well.

Existing benchmarks don't accurate measure the impact of memory speed or throughput

Existing benchmarks don’t accurately measure storage performance on things like video playback

Inaccurate measurement of memory and storage performance — There is evidence that existing mobile benchmarks do not accurate measure the impact of faster memory speeds or storage performance. Examples above and below. MobileBench is supposed to address this issue, but it would be better if there was a reliable benchmark that was not partially created memory suppliers like Samsung.

Existing benchmarks don’t accurate measure the impact of memory speed or throughput

Inaccurate measurement of the heterogenous nature of mobile devices — Only 15% of a mobile processor is the CPU. Modern mobile processors also have DSPs, image processing cores, sensor cores, audio and video decoding cores, and more, but not one of today’s mobile benchmarks can measure any of this. This is a big problem.

Case Study 1: Is the New iPad Air Really 2-5x as Fast As Other iPads?

There have been a lot of articles lately about the benchmark performance of the new iPad Air. The writers of these article truly believe that the iPad Air is dramatically faster than any other iPad, but most real world tests don’t show this to be true. This video compares 5 generations of iPads.

Benchmark tests suggest the iPad Air should be much faster than previous iPads

Results of side-by-side video comparisons between the iPad Air and other iPads:

Test 1 – Start Up – iPad Air started up 5.73 seconds faster than the iPad 1. That’s 23% faster, yet the Geekbench 3 benchmark suggests the iPad Air should be over 500% faster than an iPad 2. I would expect the iPad Air would be more than 23% faster than a product that came out 3 years and 6 months ago. Wouldn’t you?

Test 2 – Page load times – The narrator claims the iPad Air’s new MIMO antennas are part of the reason the new iPad Air loads webpages so much faster. First off, MIMO antennas are not new in mobile devices; They were in the Kindle HD two generations ago. Second, apparently Apple’s MIMO implementation isn’t effective, because if you freeze frame the video just before 1:00, you’ll see the iPad 4 clearly loads all of the text on the page before the iPad Air. All of the images on the webpage load on the iPad 4 and the iPad Air at exactly the same time – even though browser-based benchmarks suggest the iPad Air should load web pages much faster.

Test 3 – Video Playback – On the video playback test, the iPad Air was no more than 15.3% faster than the iPad 4 (3.65s vs. 4.31s)

Reality: Although most benchmarks suggest the iPad Air should be 2-5x faster than older iPads, at best, the iPad Air is only 15-25% faster than the iPad 4 in real world usage, and is some cases it is no faster.

Final Thoughts

You should never make a purchasing decision based on benchmarks alone. Most popular benchmarks are flawed because they don’t predict real world performance and they don’t take into consideration power consumption. They measure your mobile device in a way that you never use it: running all-out while it’s plugged into the wall. It doesn’t matter how fast your mobile device can operate if your battery only lasts an hour. For the reason top benchmarking bloggers like AnandTech have stopped using the AnTuTu, BenchmarkPi, Linpack and Quadrant benchmarks, but they still continue to propagate the myth that benchmarks are an indicator of real world performance. They claim they use them because they aren’t subjective, but then them mislead their readers about their often meaningless nature.

Some benchmarks do have their place however. Even though they are far from perfect they can be useful if you understand their limitations. However you shouldn’t read too much into them. They are just one indicator, along with product specs and side-by-side real world comparisons between different mobile devices.

Bloggers should spend more time measuring things that actually matter like start-up and shutdown times, Wi-Fi and mobile network speeds in controlled reproducible environments, game responsiveness, app launch times, browser page load times, task switching times, actual power consumption on standardized tasks, touch-panel response times, camera response times, audio playback quality (S/N, distortion, etc.), video frame rates and other things that are related to the ways you use your device.

Although most of today’s mobile benchmarks are flawed, there is some hope for the future. Broadcom, Huawei, OPPO, Samsung Electronics and Spreadtrum recently announced the formation of MobileBench, a new industry consortium formed to provide more effective hardware and system-level performance assessment of mobile devices. They have a proposal for a new benchmark that is supposed to address some of the issues I’ve highlighted above. You can read more about this here.

A Mobile Benchmark Primer

If you are wondering which benchmarks are the best, and which should not be used,

this article

should be of use.

Benchmarks like this one suggest the iPhone 5 is twice as fast as the iPhone 4S.

Case Study 2: Is the iPhone 5 Really Twice as Fast?

Note: Although this section was written about the iPhone 5, this section applies equally to the iPhone 5s. Like the iPhone 5, experts say the iPhone 5s is twice as fast in some areas — yet most users will notice little if any differences that are related to hardware alone. The biggest differences are related to changes in iOS 7 and the new registers in the A7.

Apple and most tech writers believe the iPhone 5’s A6 processor is twice as fast as the chip in the iPhone 4S. Benchmarks like the one in the above chart support these claims. This video tests these claims.

In tests like this one, the iPhone 4S beats the iPhone 5 when benchmarks suggest it should be twice as slow.

Results of side-by-side comparisons between the iPhone 5 to the iPhone 4S:

Opening the Facebook app is faster on the iPhone 4S (skip to 7:49 to see this).

The iPhone 4S also recognizes speech much faster, although the iPhone 5 returns the results to a query faster (skip to 8:43 to see this). In a second test, the iPhone 4S once again beats the iPhone 5 in speech recognition and almost ties it in returning the answer to a math problem (skip to 9:01 to see this).

App launches times vary, in some cases iPhone 5 wins, in others the iPhone 4S wins.

The iPhone 4S beats the iPhone 5 easily when SpeedTest is run (skip to 10:32 to see this).

The iPhone 5 does load web pages and games faster than the iPhone 4S, but it’s no where near twice as fast (skip to 12:56 on the video to see this).

I found a few other comparison videos like this one, which show similar results. As the video says, “Even with games like “Wild Blood” (shown in the video at 5:01) which are optimized for the iPhone 5s screen size, looking closely doesn’t really reveal anything significant in terms of improved detail, highlighting, aliasing or smoother frame-rates.” He goes to say, “the real gains seem to be in the system RAM which does contribute to improved day to day performance of the OS and apps.”

So the bottom line is: Although benchmarks predict the iPhone 5 should be twice as fast as the iPhone 4S, in the real-world tests, the difference between the two is not that large and partially due to the fact that the iPhone 5 has twice as much memory. In some cases, the iPhone 4S is actually faster, because it has less pixels to display on the screen. The same is true for tests of the iPad 4 which reviewers say “performs at least twice as fast as the iPad 3.” However when it comes to actual game play, the same reviewer says, “I couldn’t detect any difference at all. Slices, parries and stabs against the monstrous rivals in Infinity Blade II were fast and responsive on both iPads. Blasting pirates in Galaxy on Fire HD 2 was a pixel-perfect exercise on the two tablets, even at maximum resolution. And zombie brains from The Walking Dead spattered just as well on the iPad 3 as the iPad 4.”

– Rick

Follow me on Twitter @mostlytech1

Filed under Blog Posts, Mobile Technology Tagged with AnTuTu, Benchmarks, Geekbench, GLBench, Google Octane, iPhone 5, iPhone 5S, MobileBench, Quadrant Standard, Samsung Galaxy SIII, Sunspider, technology, Vellamo

Mostly-Tech

The Best Android Mobile Benchmarks

Best

Very Good

Useful in Some Cases

Not Recommended

The Dirty Little Secret About Mobile Benchmarks

Why Mobile Benchmarks Are Almost Meaningless

Case Study 1: Is the New iPad Air Really 2-5x as Fast As Other iPads?

Final Thoughts

A Mobile Benchmark Primer

Case Study 2: Is the iPhone 5 Really Twice as Fast?

Most Popular Posts

Recommended Posts

Recent Posts

Archives