My parcel has just arrived. To say that I am excited would be an understatement; I’ve been waiting for this, to the point where I was almost unhappy about enjoying a public holiday, because the transporter wouldn’t come. Even worse, my contact at NVIDIA sent me an email letting me know that the parcel was on it’s way, and included a photo. I was salivating. That was on a Friday. Now, Monday morning, I can finally open the box. But wait… Before I do, there is a lot I need to talk about. What exactly is the NVIDIA Jetson TK1? This isn’t an ordinary board, far from it. NVIDIA sells the Tegra series as mobile Superchips, if not Supercomputers, and that will take quite some explaining.
This review will be in two parts; first of all, I’ll explain a little about NVIDIA, and their graphics cards, and how a high-end graphics card suddenly becomes a chip used in tablets. The second part will be about the Jetson TK1 itself, including more technical details.
To know where we are, we need to know where we come from, and the Tegra has a fascinating history.
In the wonderful world of computer manufacturers, if you ask most people who designs microchips, most people will automatically respond with Intel or AMD. These are the heart of the computer, and without the CPU, the computer simply won’t function. That is a particularly unfair statement, since a computer is far more than simple a CPU, and besides, this also means that graphics card manufacturers are forgotten.
You might go out and buy a CPU, but no-one buys a graphics chip, they buy graphics cards, and this might be why people never think about companies like NVIDIA, however NVIDIA has been making chips for a long time now – in 1999, they designed and manufactured the NV10, a graphics chip that blew the competition out of the water. This was the very first step that would change computer graphics forever, and so the GeForce series was born. NVIDIA has been designing and producing chips ever since, anticipating market needs, and providing some of the fastest hardware available.
From past to present
The role of a graphics chip is a simple one – print stuff onto the screen. If I want a window to be displayed, I tell the graphics card that I want a rectangle from here to here, and do that as fast as you can. That was the world of graphics card over a decade ago, but things have changed quite a bit. At the time, graphics cards were all about pixels; the CPU would calculate what should be put on the screen, and transmit that information to the graphics card. The graphics card would read it’s memory, and print each pixel to the monitor, over and over again. The limits were quickly reached; some graphics cards would be faster than others, but all the hard work was being done by the CPU. Graphics cards were about to change.
Rendering a 3D scene, while mathematically complex, is logically simple. The same calculation is done over and over again, until the final scene is finished. In essence, each pixel is calculated using complex formulae, and then the process is repeated. These devices were no longer simple chips, they were processors in their own right, and were now called GPUs, short for Graphical Processing Unit.
GPUs were designed to be able to run parallel calculations, after all, if rendering each pixel is done the same way, then why not render 10 at the same time? Or 100? And so it began.
This happened at a time when our computing need largely shadowed our real computing power; people like SETI were looking for volunteers to help them scour the sky looking for friends. You could download a small client onto your computer, and it would run calculations on the CPU when your computer wasn’t in use. A few people thought about this, and looked at their systems. While their CPU was running at 100%, the graphics card would be idle, or in the worst case, rendering a small 3D scene as a screen saver – why couldn’t the graphics card be used to help calculate? Some tinkerers got to work, and some companies like NVIDIA listened closely. Some GPUs were suddenly called GPGPUs; General Purpose GPUs, capable of doing more than render 3D objects.
Supercomputing on a card
NVIDIA developed the CUDA technology, short for Compute Unified Device Architecture. Designed specifically for parallel computing, it has an impressive amount of functions for graphical processing, but can also be used for numerous other fields very effectively. It does require special code, it does require special compilers, but the result is well worth the effort. Those who are old enough to remember what life was like back in the days of single cores will know what I’m talking about; my Pentium 3 computer had a motherboard that could accept two CPUs, both running in parallel. In theory, it meant that the computer could do twice as much work, since both CPUs would be busy with their own tasks. My next computer was based on the Pentium 4, which has a technique known as Hyper-Threading, it was possible to run two threads at the same time, essentially having something similar to my previous two-CPU computer. Today, my CPU has 8 cores, theoretically able to run 8 calculations in parallel. I say theoretically, since the operating system is running, but what if that weren’t true? What if the processor itself really did 8 calculations in parallel? Imagine applying a filter to a digital image. My wallpaper is in Full HD, or 1920×1080 pixels, for a total of just over two million pixels. Having to calculate each pixel one at a time is time consuming, calculating 8 at the same time is faster, since it will be the same mathematical formula that will be applied. In reality, the CPU has some advanced functions that allow it to do far more than 8 pixels at a time using arrays, or digital signal processing, but despite it’s speed, it is blown out of the water by my graphics card.
I have an NVIDIA GTX 970. While most people immediately ask “Hey, that’s awesome, what games do you play?” they are often disappointed by my answer. I don’t play that much, but I’ll admit that when I do, I have a lot of fun with this card. Remember what I said about CUDA? This card supports CUDA, and has a total of 1664 CUDA cores, on a single card. For my previous example, that means 1664 pixels calculated in parallel, practically one line per calculation. Of course, this is a consumer card, all be it a high end one, but much more advanced systems exists. NVIDIA’s Tesla K80 has 4992 CUDA cores, and it is designed solely as a calculator; you won’t find a DVI or HDMI connector on it, despite having its origins deep in the graphics card field. They are known as accelerators, and can accelerate some calculations at 20 times what the CPU alone can do.
The technology itself is awesome, but it comes at a price. Not just a price tag, the technology itself. In order to make use of CUDA cores, the system must have a PCI Express slot, and a power supply beefy enough to power this beast. For some applications, this will be fine as it is; a single modern computer can have more performance than some of the best supercomputers from only a decade ago. The problem is, heavy calculation is needed everywhere, not just on the desktop. The technology is being used on self-driving cars, but do I really need an ATX case in the car just to get access to this technology? No. Not anymore.
Supercomputing on a chip
Mobile devices need to be at least two things – small, and efficient. A mobile telephone cannot include a PCI-Express slot, and most certainly cannot run at 150W. The GPU is itself a chip, but it can’t be used like a normal processor; you won’t find an operating system for your graphics card, and just taking out the GPU and putting it on an electronics board won’t do the trick, it simply isn’t designed for that. It is designed to receive instructions, to make horrendously complex calculations look like a breeze, and to give the results back to another processor. So how do you create a CPU with CUDA cores? The answer comes from Cambridge, in the United Kingdom.
ARM is a British company that has just celebrated a milestone; over 50 billion processors have been sold. Ironically, ARM doesn’t make processors, not one. They design processor IP, and licence that IP out to companies. You will find ARM processors being made by Atmel, Silicon Labs, STMicroelectronics and Samsung, to name but a few. They have over 1000 partners. ARM chips can be found everywhere; if you have a mobile telephone, the chances are it is powered by an ARM chip. ARM’s strategy is interesting; they design the computational core, and let silicon manufacturers add the hardware that they want. This is normally memory controllers, communication peripherals or other such hardware, but in the case of NVIDIA, they put their Kepler GPU with CUDA cores right onto an ARM chip, and called it the Tegra series. The ARM processor has full support for operating systems, and drivers can be made to access the computation cores. The resulting Tegra is small enough and efficient enough to be placed on mobile devices like mobile phones and tablets, and if you can get your hands on an NVIDIA Shield Tablet, have a look at the graphics performance.
NVIDIA Jetson TK1 Review
When you design an electronic components, you need to show the world how good it is. There are several ways to do this. One is by reputation alone; NVIDIA is an industry heavyweight, so everyone is watching. Documentation helps, and again, NVIDIA has provided lots of it, but that still isn’t enough. To let engineers get a feel for your component, you need to provide a solution that works, one that can be used to test the device, and if possible, actually be used for the first few projects – a development kit. To show off the power of the Tegra K1, NVIDIA has provided the NVIDIA Jetson TK1 development kit, and mine has just arrived.
It all starts here. A simple box, with a wide NVIDIA seal. As with most NVIDIA products, the design is extremely well thought out, and even if that isn’t the most important factor, first impressions are important. First you have the Jetson TK1 itself, and hidden underneath are the power supply, cables and rubber feet.
So let’s have a look at the specifications on NVIDIA’s website:
- Tegra K1 SOC
- NVIDIA Kepler GPU with 192 CUDA cores
- NVIDIA 4-Plus-1™ quad-core ARM® Cortex-A15 CPU
- 2 GB memory
- 16 GB eMMC
- Gigabit Ethernet
- USB 3.0
- HDMI 1.4
- Line out/Mic in
- RS232 serial port
- Expansion ports for additional display, GPIOs, and high-bandwidth camera interface
- Power supply and cables
- Micro USB-USB
Those are indeed some pretty impressive specifications. Two gigabytes of memory is a considerable amount of memory for most evaluation boards, and is more than enough for a complete operating system, and for multitasking scientific applications. The 16 gigabytes of flash memory is also more than enough for evaluation purposes, and even for prototyping complete applications. With gigabit Ethernet and USB3, the board is just begging to get data, process it and send it out.
There have been complaints about a single USB port, but I don’t agree. If you need more USB, then you should be using a USB hub for development; when the board is deployed, you rarely need more than one USB port anyway. Also, this isn’t a single board computer, even if it can be used as one. While testing, the Jetson TK1 was connected to a wireless keyboard/mouse combo on a single USB dongle.
The addition of SATA and a Mini-PCIe port means that extension won’t be a problem. The SATA port will allow you to add just about any hard drive size for data storage, and the mini-PCIe port can add wireless capabilities, and I suspect that it can also be used as a more robust solid-state storage port.
The device comes with a beefy 12V 5A power supply, but the device can use far less; that extra power is used mainly for the mini-PCIe port, the SATA port with included power connector, and the USB port.
First boot is always an exciting process, and on the Jetson TK1, I was surprised. The board booted me into a console, not into a graphical environment as I thought it would. That is just as well, it allowed me to play a bit inside the console to get the feel of the system. This is a stock Ubuntu 14.04 for ARM, not a specific environment that forces you to recompile everything. Want a package? Just tell the package manager to add it, the entire Ubuntu ARM repository is at your disposal. Nice.
There is an install folder, and installation is done by simply running a bash script. The graphical environment is ready, time to fire it up! Unfortunately, there isn’t a lot to see; the board does not come pre-installed with software to show off the board’s capabilities, but they aren’t too far away.
I expected the Jetson TK1 to perform well, but nothing prepared me for what I was about to see. First I tried a “simple” WebGL page, my favourite is the aquarium at http://http://webglsamples.org/aquarium/aquarium.html. At 50 fish, my desktop graphics card (a GTX 970) under Ubuntu 14.04 maxes out at 60fps, the refresh rate for my Full-HD monitor, begging for more fish. Ten times that, and I’m still at 60fps. At 2000, I get a drop, running at close to 50fps, and with 4000 fish, my card can display anywhere between 30 and 40fps.
With 50 fish, the Jetson is also at 60fps, once again, begging for more. 500 fish does give a performance hit compared to the desktop, at around 40fps. With 4000 fish, the Jetson TK1 does a very respectable 20fps.
The Jetson handled every single WebGL example I tried admirably, never showing a scene that wasn’t fluid enough to display in Full-HD. There are quite a few desktop computers that can’t handle this, and it isn’t something you’d expect from a mobile processor.
As impressive as that is, it is “only” a “basic” test. It does show off 3D performance, but only generic code, nothing compiled and optimized for the Tegra processor, which makes the results all the more impressive.
I find CUDA to be a very exciting technology, one that isn’t understood enough. It allows complex mathematical calculations to be performed in parallel, which is useful when simulating several bodies. Known as N-body simulation, each body (or particle) interacts with every other particle. You might have seen a screensaver of particles that interact with each-other’s gravity, forming spirals that look like galaxies. While they are pretty, the science behind them is actually used in astrophysics, and complex calculations can determine the creation of the galaxy, and what exactly will happen to the galaxy in a few million years.
Of course, that screensaver doesn’t use highly complex functions, it uses functions that are fast, and that can display quickly. There is no scientific benefit, only eye candy. The Jetson TK1, on the other hand, comes with a true demonstration of N-body simulation, one that is complex and precise. It also has a few interesting options. It allows you to simulate a galaxy with gravitational attraction, composed of 20224 bodies, a number that makes most scientists go pale. It runs at a very respectable 15fps (until now, I’ve never seen this on a computer in anything even close to real-time). It also displays the amount of calculation, and I had to check that figure a few times to make sure. My NVIDIA Jetson TK1 was running the simulation at 124 GFLOPS/s. One hundred and twenty-four thousand million floating-point operations per second. To put that figure into perspective, each node of the NEC SX-6 super-computer could calculate up to 64GFLOPS/s. A node was the size of a large refrigerator, and I don’t dare imagine how much it cost to run, let alone the price to buy one.
To show off a subtle mix of 3D and CUDA, another application is provided, once again using N-body simulation, but one that is far too much fun. 16384 individual balls are simulated in a closed 3D environment. There are several options; one to release a cube containing all the balls, one to simulate a sphere of balls falling onto the rest, and my personal favourite, one where the cursor becomes a huge ball, and you get to whack the others, and watch the result. That must have taken me an hour. So what’s the point? To have fun? Why not, after all, making applications fun makes for a great learning experience, but that isn’t the point. This program highlights the fact that simulations can be run in real-time, making things much easier for the people using them.
I failed. I admit defeat. Let me explain.
Coming from NVIDIA, you expect the Tegra K1 to have on-board video decoding, and of course, it has everything you need. It isn’t hard to set up an XMBC player on the board, and so I did. Time to test the video playback, and see just how far the board can go.
Video decoding has been big business for some time now. Exit the days of the good old VHS cassette, it is all about digital today! At the time, video decoding was simple, there wasn’t any. A VHS cassette quite literally held every pixel that was to be printed on the screen, but that was only 250 lines. Things have changed quite a bit since then. Full HD uses just over 2 million pixels per image, so it isn’t possible to record every single pixel in every single frame, some sort of video decoder must be used. There is always a slight loss, but for most applications, it is imperceptible. It is now all about bandwidth; the more bandwidth, the more data gets sent to the decoder every second, and therefore the higher the quality. Some scenes don’t require a lot of bandwidth; if the protagonist isn’t moving, and only talking on a static background, then minimal data is sent. On the other hand, if you are filming a scene with a lot of random movement, then the bitrate goes up, in extreme cases, higher than what the decoder can accept. This is sometimes the case for scenes of sunlight reflecting on water, where the movement changes and cannot be mathematically predicted.
The Blu-ray specifications define the standard speed, or 1x speed, as 36Mbit/s. This is what you will see for most Blu-Ray discs, and what every player can accept. 3D Blu-Ray requires twice the bandwidth, or 72Mbit/s. Future 4K Blu-Rays will come in three sizes, and the highest data-rate will be 128Mbit/s for 100GB discs encoded for 4K UHD, or 3840×2160, a phenomenal data-rate that won’t be in players until at least Christmas 2015.
So, how fast does the Jetson TK decode? Well, to be honest, I don’t know. I tried a 40Mbit/s file, and no surprises, the Jetson TK1 handles it perfectly. It can decode Blu-Ray quality videos without a problem. I pushed that up to 60, and once again, it performed perfectly. 80? Yup. So I pushed it up to 120Mbit/s. The Jetson TK1 decoded it perfectly, while looking at me smugly, asking “What, is that all you’ve got?”. Well, yes, that is all that I’ve got, I literally do not have anything faster. I thought that I might have a source problem, so I ran the same test of an XMBC-based Raspberry Pi, and it froze so badly I had to hard reset it to get it back to the title screen.
So the answer is: I don’t know how far the Tegra K1 can go, because it blew every single test I had out of the water. There wasn’t even a hint of lag, even with bit-rates that it isn’t supposed to support yet.
Put simply, I’m in love with the Jetson TK1, and with the entire Tegra series. Now that being said, this isn’t a magical processor by any means. From what I’ve seen of the benchmarks, this isn’t the fastest ARM core I’ve seen, it is overtaken by a few chips, but not by that much. There again, this chip isn’t about the ARM core. The ARM core can run an operating system well enough, and fast enough, with enough ressources to get the job done. When you do need more power, that is when the graphics core comes into play.
Graphically, it is the fastest ARM-based processor I’ve ever seen by a long shot, and I’m not surprised that this chip is used in NVIDIA’s tablet and their games console. If you need fast graphics, the Tegra is an excellent choise, with OpenGL support right where you need it.
Electrically, this isn’t the most power conservative chip ever, but that is to be expected – more transistors, more gate switching, and therefore more power. It might, however, be the most energy efficient processor I’ve seen, and there is a difference. I wasn’t able to do serious power benchmarking, but I’m pretty sure it has the most FLOPS per Watt ratio.
What it does have is CUDA. CUDA deserves a lot more love, or maybe a little more understanding about what it really can do. Capable of performing 192 parallel calculations, it won’t automatically speed up your programs. You will have to rewrite portions of your program to make the most of CUDA, but the effort required versus the huge performance boost make it worth while. Also, not every program can benefit from CUDA; you won’t compile your kernel any faster, and when using single values, there is no speed boost, but when running repetitive calculations on large datasets, CUDA is worth every single second spent developing.
This isn’t an NVIDIA Tesla, it “only” has 192 CUDA cores, but that’s 192 cores that can be used in a mobile environment. Students have come up with self-driving model cars, robots that walk and that can sense their environment and avoid obstacles, and some of the geekiest things I’ve seen for quite some time. And for professionals? Tesla (the car maker, not the accelerator) uses Tegra chips extensively in their designs, for infotainment, navigation and as a super-calculator, capable of driving the car by itself. Not bad for a mobile chip sold in an FCBGA package. Only a decade ago, this was achieved computers the size of a large server rack, and now it fits on my lap, and runs of batteries.
So, this begs the question, who would need such a board, and indeed who would need such a chip? If you are looking for a simple ARM computer to put an Ubuntu on, then don’t bother; there are cheaper designs on the market, but this isn’t what the Jetson is about. If you are looking for a great graphics solution, either ultra fast video decoding or lightning fast 3D, then the Jetson TK1 is an excellent choice. Not only does it happily decode today’s standards, but since I wasn’t able to push it to its limits, I’d wager that this device is future-proof, or at least will be for quite some time.
People who need serious mobile calculation will be the most interested. The Jetson TK1 has two camera ports, and enough calculation power to do intentive calculation on what the board can see. The Tegra is used extensively in self-driving cars, and I can see why.