A Full Hardware Guide to Deep Learning

2015-03-09 by Tim Dettmers 400 Comments

Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high performance system.

In my work on parallelizing deep learning I built a GPU cluster for which I needed to make careful hardware selections. Despite careful research and reasoning I made my fair share of mistakes when I selected the hardware parts which often became clear to me when I used the cluster in practice. Here I want to share what I have learned so you will not step into the same traps as I did.

GPU

This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge too ignore.

I talked at length about GPU choice in my previous blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. Generally, I recommend a GTX 680 from eBay if you lack money, a GTX Titan X (if you have the money; for convolution) or GTX 980 (very cost effective; a bit limited for very large convolutional nets) for the best current GPUs, a GTX Titan from eBay if you need cheap memory. I supported the GTX 580 before, but due to new updates to the cuDNN library which increase the speed of convolution dramatically, all GPUs that do not support cuDNN have become obsolete — the GTX 580 is such a GPU. If you do not use convolutional nets at all however, the GTX 580 is still a solid choice.

Suspect line-up

Can you identify the hardware part which is at fault for bad performance? One of these GPUs? Or maybe it is the fault of the CPU after all?

CPU

To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU, but your CPU does still work on these things:

  • Writing and reading variables in your code
  • Executing instructions such as function calls
  • Initiating function calls on your GPU
  • Creating mini-batches from data
  • Initiating transfers to the GPU

Needed number of CPU cores

When I train deep neural nets with three different libraries I always see that one CPU thread is at 100% (and sometimes another thread will fluctuate between 0 and 100% for some time). And this immediately tells you that most deep learning libraries – and in fact most software applications in general – just use a single thread. This means that multi-core CPUs are rather useless. If you run multiple GPUs however and use parallelization frameworks like MPI, then you will run multiple programs at once and you will need multiple threads also. You should be fine with one thread per GPU, but two threads per GPU will result in better performance for most deep learning libraries; these libraries run on one core, but sometimes call functions asynchronously for which a second CPU thread will be utilized. Remember that many CPUs can run multiple threads per core (that is true especially for Intel CPUs), so that one core per GPU will often suffice.

CPU and PCI-Express

It’s a trap! Some new Haswell CPUs do not support the full 40 PCIe lanes that older CPUs support – avoid these CPUs if you want to build a system with multiple GPUs. Also make sure that your processor actually supports PCIe 3.0 if you have a motherboard with PCIe 3.0.

CPU cache size

As we shall see later, CPU cache size is rather irrelevant further along the CPU-GPU-pipeline, but I included a short analysis section anyway so that we make sure that every possible bottleneck is considered along this pipeline and so that we can get a thorough understanding of the overall process.

CPU cache is often ignored when people buy a CPU, but generally it is a very important piece in the overall performance puzzle. The CPU cache is very small amount of on chip memory, very close to the CPU, which can be used for high speed calculations and operations. A CPU often has a hierarchy of caches, which stack from small, fast caches (L1, L2), to slow, large caches (L3, L4). As a programmer, you can think of it as a hash table, where every entry is a key-value-pair, and where you can do very fast lookups on a specific key: If the key is found, one can perform fast read and write operations on the value in the cache; if the key is not found (this is called a cache miss), the CPU will need to wait for the RAM to catch up and will then read the value from there – a very slow process. Repeated cache misses result in significant decreases in performance. Efficient CPU caching procedures and architectures are often very critical to CPU performance.

How the CPU determines its caching procedure is a very complex topic, but generally one can assume that variables, instructions, and RAM addresses that are used repeatedly will stay in the cache, while less frequent items do not.

In deep learning, the same memory is read repeatedly for every mini-batch before it is sent to the GPU (the memory is just overwritten), but it depends on the mini-batch size if its memory can be stored in the cache. For a mini-batch size of 128, we have 0.4MB and 1.5 MB for MNIST and CIFAR, respectively, which will fit into most CPU caches; for ImageNet, we have more than 85 MB () for a mini-batch, which is much too large even for the largest cache (L3 caches are limited to a few MB).

Because data sets in general are too large to fit into the cache, new data need to be read from the RAM for each new mini-batch – so there will be a constant need to access the RAM either way.

RAM memory addresses stay in the cache (the CPU can perform fast lookups in the cache which point to the exact location of the data in RAM), but this is only true if your whole data set fits into your RAM, otherwise the memory addresses will change and there will be no speed up from caching (one might be able to prevent that when one uses pinned memory, but as you shall see later, it does not matter anyway).

Other pieces of deep learning code – like variables and function calls – will benefit from the cache, but these are generally few in number and fit easily into the small and fast L1 cache of almost any CPU.

From this reasoning it is sensible to conclude, that CPU cache size should not really matter, and further analysis in the next sections is coherent with this conclusion.

Needed CPU clock rate (frequency)

When people think about fast CPUs they usually first think about the clock rate.  4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors. Also it is not always the best measure of performance.

In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.

While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 200 epochs MNIST or a quarter epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a base line for each CPU. For comparison: Upgrading from a GTX 680 to a GTX Titan is about +15% performance; from GTX Titan to GTX 980 another +20% performance; GPU overclocking yields about +5% performance for any GPU

So why is the CPU usage at 100% when the CPU core clock rate is rather irrelevant? The answer might be CPU cache misses: The CPU is constantly busy with accessing the RAM, but at same time the CPU has to wait for the RAM to catch up with its slower clock rate and this might result in a paradoxically busy-with-waiting state. If this is true, then underclocking the CPU core would not result in dramatic decreases in performance – just like the results you see above.

The CPU also performs other operations, like copying data into mini-batches, and preparing data to be copied to the GPU, but these operations depend on the memory clock rate and not the CPU core clock rate. So now we look at the memory.

Needed RAM clock rate

CPU-RAM and other interactions with the RAM are quite complicated. I will here show a simplified version of the process. Lets dive in and dissect this process from CPU RAM to GPU RAM for a more thorough understanding.

The CPU memory clock and RAM are intertwined. The memory clock of your CPU determines the maximum clock rate of your RAM and both pieces are the overall memory bandwidth of your CPU, but usually the RAM itself determines the overall available bandwidth because it can be slower than the CPU memory rate. You can determine the bandwidth like this:

Where the 64, is for a 64-bit CPU architecture. For my processors and RAM modules the bandwidth is 51.2GB/s.

However, the bandwidth is only relevant if you copy large amounts of data. Usually the timings – for example 8-8-8 – on your RAM are more relevant for small pieces of data and determine how long your CPU has to wait for your RAM to catch up. But as I outlined above, almost all data from your deep learning program will either easily fit into the CPU cache, or will be much too large to benefit from caching. This implies that timings will be rather unimportant and that bandwidth might be important.

So how does this relate to deep learning programs? I just said that bandwidth might be important, but this is not so when we look at the next step in the process. The memory bandwidth of your RAM determines how fast a mini-batch can be overwritten and allocated for initiating a GPU transfer, but the next step, CPU-RAM-to-GPU-RAM is the true bottleneck – this step makes use of direct memory access (DMA). As quoted above, the memory bandwidth for my RAM modules are 51.2GB/s, but the DMA bandwidth is only 12GB/s!

The DMA bandwidth relates to the regular bandwidth, but the details are unnecessary and I will just refer you to this Wikipedia entry, in which you can look up the DMA bandwidth for RAM modules (peak transfer limit). But lets have a look at how DMA works.

Direct memory access (DMA)

The CPU with its RAM can only communicate with a GPU through DMA. In the first step, a specific DMA transfer buffer is reserved in both CPU RAM and GPU RAM; in the second step the CPU writes the requested data into the CPU-side DMA buffer; in the third step the reserved buffer is transferred to your GPU RAM without any help of the CPU. Your PCIe bandwidth is 8GB/s (PCIe 2.0) or 15.75GB/s (PCIe 3.0), so you should get a RAM with a good peak transfer limit as determined from above, right?

Not necessarily. Software plays a big role here. If you do some transfers in a clever way, you will get away with cheaper slower memory. Here is how.

Asynchronous mini-batch allocation

Once your GPU finished computation on the current mini-batch, it wants to immediately work on the next mini-batch. You can now of course, initiate a DMA transfer and then wait for the transfer to complete so that your GPU can continue to crunch numbers. But there is a much more efficient way: Prepare the next mini-batch in advance so that your GPU does not have to wait at all. This can be done easily and asynchronously with no degradation in GPU performance.

CUDA Code for asynchronous mini-batch allocation: The first two calls are made when the GPU starts with the current batch; the last two calls are made when the GPU finished with the current batch. The transfer of the data will be completed long before the stream is synchronized in the second step, so there will be no delay for the GPU to begin with the next batch.

An ImageNet 2012 mini-batch of size 128 for Alex Krishevsky’s convolutional net takes 0.35 seconds for a full backprop pass. Can we allocate the next batch in this time?

If we take the batch size to be 128 and the dimensions of the data 244x244x3 that is a total of roughly 0.085 GB (). With an ultra-slow memory we have 6.4 GB/s, or in other terms 75 mini-batches per second! So with asynchronous mini-batch allocation even the slowest RAM will be more than sufficient for deep learning. There is no advantage in buying faster RAM modules if you use asynchronous mini-batch allocation.

This procedure also implies indirectly that the CPU cache is irrelevant. It does not really matter how fast your CPU can overwrite (in the fast cache) and prepare (write the cache to RAM) a mini-batch for a DMA transfer, because the whole transfer will be long completed before your GPU requests the next mini-batch – so a large cache really does not matter much.

So the bottom line is really that the RAM clock rate is irrelevant. Buy what is cheap – end of story.

But how much should you buy?

RAM size

You should have at least the same RAM size as your GPU has. You could work with less RAM, but you might need to transfer data step by step. From my experience however, it is much more comfortable to work with more RAM.

Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM.  With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice.

Hard drive/SSD

A hard drive can be a significant bottleneck in some cases for deep learning. If your data set is large you will typically have some of it on your SSD/hard drive, some of it in your RAM, and two mini-batches in your GPU RAM. To feed the GPU constantly, we need to provide new mini-batches with the same rate as the GPU can go through each of them.

For this to be true we need to use the same idea as asynchronous mini-batch allocation. We need to read files with multiple mini-batches asynchronously – this is really important! If we do not do this asynchronously you will cripple your performance by quite a bit (5-10%) and render your carefully crafted advantages in hardware useless  – good deep learning software will run faster on a GTX 680, than bad deep learning software on a GTX 980.

With this in mind, we have in the case of the Alex’s ImageNet convolutional net 0.085GB () every 0.3 seconds, or 290MB/s if we save the data as 32 bit floating data. If we however save it as jpeg data, we can compress it 5-15 fold bringing down the required read bandwidth to about 30MB/s. If we look at hard drive speeds we typically see speeds of 100-150MB/s, so this will be sufficient for data compressed as jpeg. Similarly, one is able to use mp3 or other compression techniques for sound files, but for other data sets that deal with raw 32 bit floating point data it is not possible to compress data so well: We can compress 32 bit floating point data by only 10-15%.  So if you have large 32 bit data sets, then you will definitely need a SSD, as hard drives with a speed of 100-150 MB/s will be too slow to keep up with your GPU – so if you work with such data get a SSD, otherwise a hard drive will be fine.

Many people buy a SSD for comfort: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster, but for deep learning it is only required if your input dimensions are high and you cannot compress your data sufficiently.

If you buy a SSD you should get one which is able to hold data sets of sizes you typically work with, with an additional few tens of GBs extra space. It is also a good idea to also get a hard drive to store your unused data sets on.

Power supply unit (PSU)

Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.

You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 100-300 watts for other components and as a buffer for power spikes.

One important part to be aware of is if the PCIe connectors of your PSU are able to support a 8pin+6pin connector with one cable. I bought one PSU which had 6x PCIe ports, but which was only able to power either a 8pin or 6pin connector, so I could not run 4 GPUs with that PSU.

Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.

Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a hundred per-cent efficiency, then training such a net with a 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.

Cooling

Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink for your CPU, but what for your GPU you will need to make special considerations.

Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables best performance while keeping your GPU safe from overheating.

However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (a few per-cents) which can be significant for multiple GPUs (10-25%) where each GPU heats up the GPUs next to itself.

Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.

The easiest and most cost efficient work-around is to flash your GPU with a new BIOS which includes a new, more reasonable fan schedule which keeps your GPU cool and the noise levels at an acceptable threshold (if you use a server, you could crank the fan speed to maximum speed which is otherwise not really bearable on a noise level). You can also overclock your GPU memory with a few MHz (30-50) and this is very safe to do. The software for flashing BIO is a program designed for Windows, but you can use wine to call that program from your Linux/Unix OS.

The other option is to use to set a configuration for your Xorg server (Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.

Another, more costly, and craftier option is to use water cooling. For a single GPU, water cooling will nearly halve your temperatures even under maximum load, so that the temperature threshold is never reached. Even multiple GPUs stay cool which is rather impossible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.

From my experience these are the most relevant points. I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU – flash your BIOS, use water cooling, or live with a decrease in performance – these are all reasonable choices in certain situations. Just think about what do you want in your situation and you will be fine.

Motherboard and computer case

Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so you will need 7 slots to run 4 GPUs for example. PCIe 2.0 is okay for a single GPU, but PCIe 3.0 is quite cost efficient with respect to cost-performance even for a single GPU; for multiple GPUs always buy PCIe 3.0 boards which will be a boon when you do multi-GPU computing as the PCIe connection will be the bottleneck here.

The motherboard choice is straightforward: Just pick a motherboard that supports the hardware components that you want.

When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.

Monitors

I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.

The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor.  Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?

Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

Some words on building a PC

Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.

The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!

Conclusion / TL;DR

GPU: GTX 680 or GTX 960 (no money); GTX 980 (best performance); GTX Titan (if you need memory); GTX 970 (no convolutional nets)

CPU: Two threads per GPU; full 40 PCIe lanes and correct PCIe spec (same as your motherboard); > 2GHz; cache does not matter;

RAM: Use asynchronous mini-batch allocation; clock rate and timings do not matter; buy at least as much CPU RAM as you have GPU RAM;

Hard drive/SSD: Use asynchronous batch-file reads and compress your data if you have image or sound data; a hard drive will be fine unless you work with 32 bit floating point data sets with large input dimensions

PSU: Add up watts of GPUs + CPU + (100-300) for required power; get high efficiency rating if you use large conv nets; make sure it has enough PCIe connectors (6+8pins) and watts for your (future) GPUs

Cooling: Set coolbits flag in your config if you run a single GPU; otherwise flashing BIOS for increased fan speeds is easiest and cheapest; use water cooling for multiple GPUs and/or when you need to keep down the noise (you work with other people in the same room)

Motherboard: Get PCIe 3.0 and as many slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)

Monitors: If you want to upgrade your system to be more productive, it might make more sense to buy an additional monitor rather than upgrading your GPU

 

Update 2015-04-22: Removed recommendation for GTX 580

Share this:

Filed Under: HardwareTagged With: CPU, Deep Learning, hardware, machine learning, PCI, RAM, SSD

Reader Interactions

Comments

  • cicero19 says
  • 2015-03-09 at 20:36
  • Hi Tim,
  • This is a great overview. Wondering if you could recommend any cost-effective CPUs with 40 PCIe lanes.
  • Thanks!
  • Reply
  • Tim Dettmers says
  • 2015-03-10 at 08:41
  • There are many CPUs in all different price ranges which are all reasonable choices and most CPUs support 40 PCIe lanes. The best practice is probably to look at site like http://pcpartpicker.com/parts/cpu/ an select a CPU with a good rating and a good price; then check if it supports the 40 lanes and you will be good to go.
  • Reply
  • lU says
  • 2015-03-09 at 22:59
  • Covers everything i wanted to know and even more, thanks!
  • It also confirms my choice for a pentium g3258 for a single GPU config. Insanely cheap, and even has ecc memory support, something that some folks might want to have..
  • Reply
  • Rusty Scupper says
  • 2015-03-10 at 02:01
  • what,the.heck… Could you have skipped the blather and gotten to the point? There are only a few specific combinations that support what you were trying to explain so maybe something like:
  • – GTX 580/980
  • – i5 / i7 CPU
  • – Lots of ram (duh)
  • – Fast hard drive
  • Reply
  • zeng says
  • 2016-08-29 at 05:17
  • 授人以鱼不如授人以渔. same proverb as in Chinese.
  • very helpfull, thanks for the sharing.
  • Reply
  • Hannes says
  • 2015-03-11 at 03:45
  • I find the recommendation of the GTX 580 for *any* kind of deep learning or budget a little dubious since it doesn’t support cuDNN. What good is a GPU that doesn’t support what’s arguably the most important library for deep learning at the moment?
  • Reply
  • Tim Dettmers says
  • 2015-03-11 at 09:14
  • This is a really good and important point. Let me explain my reasoning why I think a GTX 580 is still good.
  • The problem with no cuDNN support is really that you will require much more time to set everything up and often cutting-edge features that are implemented in libraries like torch7 will not be available. But it is not impossible to do deep learning on a GTX 580 and good, usable deep learning software exists. One will probably need to learn CUDA programming to add new features through one’s own CUDA kernels, but this will just require time and not money. For some people time and effort is relatively cheap, while money is rather expensive. If you think about students in developing countries this is very much true; if you earn $5500 a year (average GDP per capita ppp of India; for the US this is $53k – so think about your GPU choice if you had 10 times less money) then you will be happy that there is a deep learning option that costs less than $120. Of course I could recommend cards, like the GTX 750, which are also in that price range and which work with cuDNN, but I think a GTX 580 (must faster and more memory) is just better than a GTX 750 (cuDNN support) or other alternatives.
  • EDIT: I think it might be good to add another option, which offers support for cuDNN but which is rather cheap, like the GTX 960 4GB (only a bit slower than the GTX 580) which will be available shortly for about $250-300. But as you see, an additional $130-180 can be very painful if you are a student in a developing country.
  • Reply
  • DarkIdeals says
  • 2016-09-08 at 08:33
  • A great 2016 update if you happen to still frequent this blog (don’t see any recent posts) is the new GTX 1060 Pascal graphic card. Specifically the 3GB model. Now 3GB is definitely cutting a tad close on memory, however it’s a VASTLY superior choice to both a 580 AND a 960 4gb. The 1060 6GB model is equivalent to a GTX 980 in overall performance, and the 3GB 1060 model is only ever-so-slightly weaker putting it at the level of a hugely overclocked GTX 970 (i’m talking like ~1,650mhz 970 levels. Which is maybe ~5% below a 980)
  • And the 3GB 1060 can be had for a measly $199 BRAND NEW! It’s definitely something to consider at least. And if you still desperately need that extra VRAM then you can even get the 6GB version of the 1060 (which as i mentioned is literally about tied with an average GTX 980! ) can be had for as little as $249 right now!
  • Reply
  • Tim Dettmers says
  • 2016-09-10 at 03:54
  • I updated my GPU recommendation post with the GTX 1060, but I did not mention the 3GB version, that did not exist at that time. Thanks for letting me know!
  • Reply
  • Tim Dettmers says
  • 2015-03-12 at 06:28
  • A K40 will be similar to a GTX Titan in terms of performance. The additional memory will be great if you train large conv nets and this is the main advantage of a K40. If you can choose the upcoming GTX Titan X in the academic grant program, this might be the better choice as it is much faster and will have the same amount of memory.
  • Reply
  • dh says
  • 2015-03-20 at 21:02
  • why is k40 much more expensive when gtx x is cheaper but has more cores and higher bandwidth?
  • Reply
  • Tim Dettmers says
  • 2015-03-20 at 21:07
  • The K40 is a compute card which is used for scientific applications (often system of partial differential equations) which require high precision. Tesla cards have additional double precision and memory correction modules which makes them excel at high precision tasks; these extra features, which are not needed in deep learning, make them so expensive.
  • Reply
  • zeecrux says
  • 2015-07-03 at 10:00
  • ImageNet on K40:
  • Training is 19.2 secs / 20 iterations (5,120 images) – with cuDNN
  • and GTX770:
  • cuDNN Training: 24.3 secs / 20 iterations (5,120 images)
  • (source: http://caffe.berkeleyvision.org/performance_hardware.html)
  • I trained ImageNet model on a GTX 960 and have this result:
  • Training is around 26 secs / 20 iterations (5,120 images) – with cuDNN
  • So GTX 960 is close to GTX 770
  • So for 450000 iterations, it takes 120 hours (5 days) on K40, and 162.5 hours (6.77 days) on GTX 960.
  • Now K40 costs > 3K USD, and GTX 960 costs < 300 USD
  • Reply
  • Tim Dettmers says
  • 2015-03-13 at 06:50
  • Thanks for your comment. NVIDIA SLI is an interface which allows to render computer graphics frames on each GPU and exchange them via SLI. The use of SLI is limited to this application, so doing computations and parallelizing them via SLI is not possible (one needs to use the PCIe interface for this). So CUDA cannot use SLI.
  • Reply
  • Tim Dettmers says
  • 2015-03-13 at 18:04
  • Glad that you liked the article. I am using Eclipse (NVIDIA Nsight) for C++/C/CUDA in that pic; I also use Eclipse for Python (PyDev) and Lua (Koneki). While I am very satisfied with Eclipse for Python and CUDA, I am less satisfied with Eclipse for Lua (that is torch7) and I probably will switch to Vim for that.
  • Reply
  • Felix Lau says
  • 2015-03-14 at 10:51
  • Thanks for this great post!
  • What’s your thought on using g2.xlarge instead of building the hardware? I believe g2.xlarge is a lot slower than GTX 980. However it is possible to spawn many instances on AWS at the same time which might be useful for tuning hyperperameter.
  • Reply
  • Tim Dettmers says
  • 2015-03-16 at 18:52
  • Indeed the g2.xlarge is much slower than the GTX 980, but also much cheaper. It is a cheap option if want to train multiple independent neural nets, but it can be very messy. I only have experience with regular CPU instances, but with those it can take considerable time to manage one’s instances, especially if you are using AWS for large data sets together with spot instances — you will definitely be more productive with a local system. But in terms of affordability GPU instances are just the best.
  • I just want you make you aware of other downsides with GPU instances, but the overall conclusion stays the same (less productivitz, but very cheap): You cannot use multiple GPUs on AWS instances because the interconnect is just too slow and will be a major bottle neck (4 GPUs will run slower than 2). Also the PCIe interconnect performance is crippled by the virtualization. This can be partly improved by a hacky patch, but overall the performance will still be bad (it might well be that 2 GPUs are worse than 1 GPU).
  • Also like the GTX 580, the GPU instances do not support newer software, and this can be quite bad if you want to run modern variants of convolutional nets.
  • Reply
  • Mark says
  • 2015-03-14 at 18:26
  • What motherboards by company and model number do you recommend (ASUS, MSI, etc) for a home PC that will be used for multimedia as well (not concerned with gaming). I am thinking of using a single GTX 980 but may think about add more GPU’s later(not a crucial concern). Also, what i7 cpu models do I need? Thanks for the help and the suggestion of the 960 alternative to the 580. I am learning Torch 7 and can afford the 980.
  • Reply
  • Tim Dettmers says
  • 2015-03-16 at 18:57
  • I only have experience with motherboards that I use, and one of them has a minor hardware defect and thus I do not think my experience is representative for the overall mainboard product, and this is similar for other hardware pieces. I think with the directions I gave in this guide you can find your pieces on your own through lists that feature user rating like http://pcpartpicker.com/parts/
  • Often it is quite practical to sort by rating and buy the first highly rated hardware piece which falls in your budget.
  • Reply
  • Stijn says
  • 2015-03-15 at 07:44
  • What is the largest dataset you can analyze, you can choose the specs you want, and how much time would it take?
  • Reply
  • Tim Dettmers says
  • 2015-03-16 at 18:59
  • The sky is the limit here. Google ran conv nets that took months to complete and which were run on thousands of computers. For practical data sets, ImageNet is one of the larger data sets and you can expect that new data sets will grow exponentially from there. These data sets will grow as your GPUs get faster, so you can always expect that the state of the art on a large popular data set will take about 2 weeks to train.
  • Reply
  • benoit says
  • 2015-03-18 at 15:25
  • Motherboard: Get PCIe 3.0 and as many slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)
  • just to be sure I get it.
  • all GPU are better on a PCIe 3.0 slot, as each GPU seems to take 2 slots (due to size) for 3 GPU you’d need a 6 PCIe 3.0 slots MB ?
  • Reply
  • Tim Dettmers says
  • 2015-03-18 at 15:31
  • That’s right, modern GPUs will run faster on a PCIe 3.0 slot.
  • To install a card you only need a single PCIe 3.0 slot, but because you have a width of two PCIe slots each card will render the PCIe slot next to it unusable. For 3 GPUs you will need 5 PCIe slots, because the first two cover 4 slots and you will need a single fifth slot for the last GPU.
  • So a motherboard with 5x PCIe 3.0 x16 is fine for 3 GPUs.
  • Reply
  • Tim Dettmers says
  • 2015-03-18 at 16:04
  • I also read a bit about risers when I was building my GPU cluster, and I often read that there was little to no degradation in performance. However, I do not know what PCIe lane configuration (e.g. 16/8/8/8, or 16/16/8 are standard for 4 and 3 GPUs, respectively) the motherboard will run under such a configuration and this might be a problem (the motherboard might not support it well). For cryptocurrency mining this is usually not a problem, because you do not have to transfer as much data over the PCIe interface if you compare that to deep learning — so probably there is no one that ever tested this under deep learning conditions.
  • So I am not really sure how it will work, but it might be worth a try to test this on one of your old mining motherboards and then buy a motherboard accordingly. If you decide to do so, then please let me know. I would be really interested in what is going on in that case and how well it works. Thanks!
  • Reply
  • salem ameen says
  • 2015-03-23 at 21:28
  • Hi Tim,
  • I bought an MSI G80 Laptop to learn and work on deep learning which connects 2 GPU using SLI, could you please tell me if I could run deep learning on this laptop even in one GPU.
  • Regards,
  • Reply
  • Tim Dettmers says
  • 2015-03-24 at 06:12
  • Yes you will be able to use a single GPU for deep learning, SLI has nothing to do with CUDA. Even if there are dual-GPUs (like the GTX 590) on a hardware level you can simply access both GPUs separately. This is also true for software libraries like theano and torch.
  • Reply
  • salemameen says
  • 2015-03-24 at 08:01
  • Thanks Tim,
  • Because I don’t have background in coding, I want to use existing libraries. By the way, I bought this laptop not for gaming for deep learning, I thought would be more powerful with 2 GPUs, but even if one works fine that is ok for me. Regards,
  • Reply
  • Tim Dettmers says
  • 2015-03-24 at 17:18
  • You’re welcome! If you use Torch7 you would will be able to use both GPUs quite easily. If you dread working with Lua (it is quite easy actually, the most code will be in Torch7 not in Lua), I am also working on my own deep learning library which will be optimized for multiple GPUs, but it will take a few more weeks until it reaches a state which is usable for the public.
  • Reply
  • Mark says
  • 2015-03-24 at 17:54
  • Looking a two possible x99 boards, ASUS x99-Deluxe (~$410 US) and ASUS Rampage V Extreme (~$450 US). Unless you know something, I do not see that the extra $40 will make any difference for ML but maybe it does for other stuff like multi-media or gaming.
  • Will start with 16G or 32G DDR4 (haved decided yet, ~$500-$700 US).
  • I plan to use the 6-core i7-5930k (~$570 US). By your recommendations of 2 cores per GPU that means max 3 GPU’s.
  • GTX 980’s are ~$500 US and GTX Titans ~$1000 US. Besides loss of PCI slots, extra liquid cooling, what speed difference does one expect in a system with two GTX 980’s versus an identical system with one GTX Titan?
  • Tim Dettmers says
  • 2015-03-31 at 06:42
  • I do not think the boards make a great difference, they are rather about the chipset (x99) than anything else.
  • I think 6 cores should also be fine for 4 GPUs. On average, the second core is only used sparsely, so that 3 threads can often feed 2 GPUs just fine.
  • One GTX Titan X will be 150% as fast as a single GTX 980, so two GTX 980 are faster, but because one GPU is much better and easier to use than two, I would go for the GTX Titan X if you can afford it.
  • Mark says
  • 2015-03-31 at 13:20
  • “One GTX Titan X will be 150% as fast as a single GTX 980, so two GTX 980 are faster, but because one GPU is much better and easier to use than two, I would go for the GTX Titan X if you can afford it.”
  • Thanks for advice. Could you elaborate a bit more on the ease of use between one gpu versus two?
  • Also, i understand the Titan will be replace this year with a faster GTX 980 Ti. They will be the same price.
  • Tim Dettmers says
  • 2015-03-31 at 13:46
  • If you use torch7 then it will be quite straight forward to use 2 GPUs on one problem (2 GPUs yield about 160% speed when compared to a single GPU); other libraries do not support multiple GPUs well (theano/pylearn2, caffe), and others are quite complicated to use (cuda-convnet2). So 160% is not much faster than a GTX Titan X, so if you want to also use different libraries, a GTX Titan X would be faster overall (and more memory too!).
  • I am just working on a library that combines the ease of use of torch7 with very efficient parallelizm (+190% speedup for 2 GPUs), but it will take a month or two until I implemented all the needed features.
  • Lucas Shen (@icrtiou) says
  • 2015-03-30 at 18:13
  • Hi Tim,
  • I’m interested about the GPU bios. Can you share what bios which includes a new, more reasonable fan schedule are you using right now? I have 2 titan x waiting to be flashed.
  • Reply
  • Tim Dettmers says
  • 2015-03-30 at 18:43
  • I do not know if GTX 970/ GTX 980 BIOS is compatible with a GTX Titan X BIOS. Doing a quick google search, I cannot find information about a GTX Titan X BIOS, which might be, because the card is relatively new.
  • I think you will find the best information in folding@home and other crowd-computing forums (also cryptocurreny mining forums) to get this working.
  • Reply
  • Lucas Shen (@icrtiou) says
  • 2015-03-30 at 19:24
  • Thanks for the pointers. fah is very interesting XD, though I don’t find titan x bios yet. Guess I have to live with it for a while.
  • I saw you have plan to release deep learning library in the future. What framework will you be working on? Torch7, Theano, Caffe?
  • Reply
  • Peyman says
  • 2015-03-31 at 06:14
  • Great guide Tim, thanks.
  • I am wondering if you get the display output from the same GPUs which you do the computation on?
  • I’m gonna buy a 40 lane i7 cpu, which is a LGA 2011 socket, along with a GTX 980. It seems that none of the CPUs with this socket have an internal GPU to drive display. And the other CPUs, LGA 1150 and LGA 1155, do not support more than 28 lanes.
  • So , the question is do I need a separate GPU to drive displays, or I can do the compute and run the displays on the same GPU?
  • Reply
  • Tim Dettmers says
  • 2015-03-31 at 06:34
  • You can use the same GPU for computation and for display, there will be no problem. The only disadvantage is, that you have a bit less memory. I use 3x 27 inch monitors at 1920×1080 and this config uses about 300-400 MB of memory which I hardly notice (well, I have 6GB of GPU memory). If you are worried about that memory you can get a cheap NVIDIA GT210 (which can hold 2 monitors) for $30 and run your display on that, so that your GTX 980 is completely free for CUDA applications.
  • Reply
  • Mark says
  • 2015-04-08 at 20:41
  • Got a bit of a compromise i am thinking about. To save on cash in picking a CPU. The i7 5820K and i7 5930K are the same except for pci lanes (28 versus 40). According to this video:
  • https://youtu.be/rctaLgK5stA
  • It comes down to using say a 4th 980 or Titan otherwise if it’s three or less then there is no real performance difference. This means a saving on the CPU of about $200.
  • What’s your thoughts since you warned about the i7 5820 in your article?
  • Reply
  • Tim Dettmers says
  • 2015-04-09 at 19:56
  • Yes the i7 5820K only has 28 PCIe lanes and if you buy more than one GPU I would choose definitely a different CPU . The penalty will be observable when you use multiple GPUs especially if you will use 4x GTX980 (personally, I would choose a cheap CPU < $250 with 40 lanes and instead buy 4x Titan GTX X — that will be sufficient) One note though, remember that in 2016 Q3/Q4 there will be Pascal GPUs, which are about 10 times better than a GTX Titan X (which is 50 % better than a GTX 980), so it might be reasonable to go with a cheaper system and go all out once Pascal GPUs are released.
  • Reply
  • Mark says
  • 2015-04-09 at 20:29
  • Well if i buy now in terms of the CPU and motherboard then I would like to upgrade this system in a couple years to Pascal. To keep this base system current over a few years then would you still recommend a x99 motherboard? If so then I am stuck with only two choices 5930 or 5960.
  • AMD has cpu’s and associated motherboards but I am not familiar with anything going that direction. Do they have something in mind here that is cheaper, about the same performance and can handle up to 4 980/titan/pascal GPU’s?
  • BTW, thought i read somewhere that no current motherboard will handle Pascal, is that correct?
  • Reply
  • Tim Dettmers says
  • 2015-04-09 at 20:40
  • A x99 motherboard might be a bit overkill. You will not need most of its features like DDR4 RAM. As you said, the Pascal GPUs will use their own interconnect which is much faster than PCIe — this would be another reason to spend less money on the current system. A system based on either the LGA1150 or the LGA2011 would be a good choice in terms of performance/cost.
  • I do not have experience with AMD either, but from the calculations in my blog post I am quite certain that it would also be a reasonable choice. I think in the end it just comes down how much money you have to spare.
  • Reply
  • Mark says
  • 2015-04-09 at 21:52
  • Great thanks! Still one thing remain unclear to a newbie builder like me. Is an x99 chip set wed to only motherboards which will not work with Volta/Pascal? If not then I can just swap out the motherboard but keep the x99 compatible CPU, memory, etc.
  • Also, since you are writing about convolutional nets, these are front-ends the feed neural nets. However, there is a new paper on using an SVM approach that needs less memory, is faster and just as accurate as any state-of-the-art covnet/neural-net combo. It keeps the convolution and pooling layers but replaces the neural net with a new fast-food (LOL) version of SVM. They claim it works “better”
  • “Deep Fried Convnets” by Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, Ziyu Wang.
  • The SVM versus neural-net battle continues.
  • Shinji says
  • 2015-04-10 at 07:31
  • Hi Tim, this is a great post!
  • I’m interested in the actual PCIe bandwidth in the deep learning process. Are PCIe 16 lanes needed for deep learning? Of course x16 PCIe gen3 is ideal for the best performance, but I’m wondering if x8 or x4 PCIe gen3 is also enough performance.
  • Which do you think better solution if the system has 64 PCIe lanes?
  • * 4 GPGPUs connected with 16 PCIe lanes each
  • * 16 GPGPUs connected with 4 PCIe lanes each
  • Which is important factor, the number of GPGPU (calculation power) or PCIe bandwidth?
  • Reply
  • Tim Dettmers says
  • 2015-04-10 at 08:50
  • Each PCIe lane for PCIe 3.0 has a theoretical bandwidth of about 1 GB/s, so you can run GPUs also with 8 lanes or 4 lanes (8 lanes is standard for at least one GPU if you have more than 2 GPUs), but it will be slower. How much slower will depend on the application or network architecture and which kind of parallelism is used.
  • 64 PCIe lanes are only supported by two CPU motherboards and these boards often have a special PCIe switching architecture which connects the two separate PCIe systems (one for each CPU) with each other; I think you can only run up to 8 GPUs with such a system (the BIOs often cannot handle more GPUs even if you have more PCIe slots). But if you take this as a theoretical example it is best to just do some test calculations:
  • 16 GPUs means 15 data transfers to synchronize information; 4 PCIe lanes / 15 transfers = 0.2666 GB/s for a full synchronization. If you now have a weight matrix with say 800×1200 floating point numbers you have 800x1200x1024^-3= 0.0036 GB. This means you could synchronize 0.2666/0.0036 = 74 gradients per second. A good implementation of MNIST with batchsize 128 will run with about 350 batches per second. So the result is that 16 GPUs with 4 PCIe lanes will be 5 times slower for MNIST. These numbers are better for convolutional nets, but not much better. Same for 4 GPUs/16 lanes:
  • 16/3 = 5.33; 5.33/0.0036 = 647; so in this case there would be a speedup of about 2 times; this is better for convolutional nets (you can except a speedup of 3.0-3.9 depending on the implementation). You can do similar calculations for model parallelism in which the 16 GPU case would fare a bit better (but it is probably still slower than 1 GPU).
  • So the bottom line is that 16 GPUs with 4 PCIe lanes are quite useless for any sort of parallelism — PCIe transfer rates are very important for multiple GPUS.
  • Reply
  • Shinji says
  • 2015-04-10 at 10:07
  • Thank you for explanation.
  • Regarding your description, it depends on application, but the data transfer time among GPUs is dominant in multiple GPUs environment.
  • However, I have another question.
  • In your assumption, the GPU processing time is always shorter than data transfer time. In 16 GPUs case, GPU processing must take less than 14 msec to process one batch. In 4 GPUs case, it must take less than about 2 msec.
  • If the GPU processing time is longer enough than data transfer time, the data transfer time for synchronization is negligible. In that case, it is important to have many GPUs rather than PCIe bandwidth.
  • Is my assumption unlikely in usual case?
  • Reply
  • Tim Dettmers says
  • 2015-04-10 at 11:02
  • This is exactly the case for convolutional nets, where you have high computation with small gradients (weight sharing). However, even for convolutional nets there are limits to this; beyond eight GPUs it can quickly become difficult to gain near-linear speedups, which is mostly due low interconnections between computers. A 8 GPU system will be reasonably fast with speedups of about 7-8 times for convolutional nets, but for more than 8 GPUs you have to use normal interconnects like infiniband. Infiniband is similar to PCIe but its speed is fixed at about 8-25 GB/s (8GB/s is the affordable standard; 16 GB/s is expensive; 25 GB/s is very, very expensive): So for 6 GPUs + 8GB/s standard connection this yields a standard bandwidth of 1.6 GB/s which is much worse than the 4 GPU 16 lanes example; for 12 GPUs this is 0.72 GB/s; 24 GPUs 0.35GB/s; 48 GPUs 0.17GB/s. So pretty quickly it will be pretty slow even for convolutional nets.
  • Reply
  • Tim Dettmers says
  • 2015-05-06 at 05:57
  • I overlooked your comment, but it is actually a very good question. It turns out that you exactly hit the mark: The less communication is needed the better are more GPUs compared to more bandwidth. However, in deep learning there are only few cases where it makes sense to trade bandwidth for more GPUs. Very deep recurrent neural networks (time dimension) would be an example, and to some degree (very) deep neural networks (20+ layers) are of this type. However, even for 20+ layers you still want to look at maximizing your bandwidth to maximize your overall performance.
  • For dense neural networks, anything above 4 GPUs is rather impractical. You can make it work to run faster, but this required much effort and several compromises in model accuracy.
  • Reply
  • Bjarke Felbo says
  • 2015-04-25 at 21:31
  • Thanks for a great guide! I’m wondering if you could give me a rough estimate of the performance boost I would get by upgrading my system? Would be awesome to have that before I spend my hard-earned money! I supposed it’s mainly based on my current GPU, but here’s a bit of info about the rest of the system as well.
  • Current setup:
  • ATI Radeon™ HD 5770 1gb
  • One of the last CPU’s from the 775-socket series.
  • 4gb ram
  • SSD
  • Upgraded setup:
  • GTX 960 4gb
  • Modern dual-thread CPU with 2+ GHz
  • 8gb ram
  • SSD
  • Two more questions:
  • 1) I’ve sometimes experienced issues between different motherboard brands and cetain GPU’s. Do you have a recommendation for a specific motherboard brand (or specific product) that would work well with a GTX 960?
  • 2) Any idea of what the performance reduction would be by doing deep learning in caffe using a Virtualbox environment of Ubuntu instead of doing a plain Ubuntu installation?
  • Reply
  • Tim Dettmers says
  • 2015-04-26 at 08:12
  • It is difficult to estimate the performance boost if your previous GPU is a ATI GPU; but for the other hardware pieces you should see about a 5-10% increase in performance.
  • 1. I never had any problems with my motherboards, so I cannot give you any advice here on that topic.
  • 2. I also had this idea once, but it is usually impossible to do this: CUDA and virtualized GPUs do not go together, you will need specialized GPUs (GRID GPUs, which are used on AWS); even if they would go together there would be a stark performance decrease.
  • It it a great change to go from windows to ubuntu, but it is really worth doing if you are serious about deep learning. A few months in ubuntu and you will never want to go back!
  • Reply
  • Bjarke Felbo says
  • 2015-04-26 at 20:19
  • Thanks for the quick response! I’ll try Ubuntu then (perhaps some dual-booting). Would it make sense to add water-cooling to a single GTX 960 or would that be overkill?
  • Reply
  • Dimiter says
  • 2015-04-28 at 08:40
  • Tim,
  • Thanks for a great write-up. Not sure what I’d have done without it.
  • A bit of a n00b question here,
  • Do you thinks it matters in practice if one has PCI2 2.0 or 3.0?
  • Thanks
  • Reply
  • Tim Dettmers says
  • 2015-04-28 at 09:42
  • If it is possible that you will have a second GPU at anytime in the future definitely get a PCIe 3.0 CPU and motherboard. If you use additional GPUs for parallelism, then in the case of PCIe 2.0 you will suffer a performance loss of about 15% for a second GPU, and much larger losses (+40%) for your third and fourth GPU. If you are sure that you will stay with one GPU in the future, then PCIe 2.0 will only give you a small or no performance decrease (0-5%) and you should be fine.
  • Reply
  • Mark says
  • 2015-04-28 at 16:09
  • This may not make much difference if you care about a new system now or about having a more current system in the future. However, if you want to keep it around for years and use it for other things besides ML then wait a few months.
  • Intel’s Skylake CPU will be released in a few months along with it’s new chip set, new socket, new motherboards etc. All PCI 3, ddr4, etc. It’s considered a big change compared to prior CPU’s. Skylake prices are suppose to be similar to current offerings but retailers say they expect the price of ddr4 to drop. Don’t really understand why but gamers are also waiting for the release … maybe just because “new and improved” since it doesn’t seem to translate into a big plus for the gaming experience.
  • Reply
  • Yu Wang says
  • 2015-04-29 at 22:38
  • Hi Tim,
  • Thanks for the insightful posts. I’m a grad student working in the image processing area. I just started to explore some deep learning techniques with my own data. My dataset contains 10 thousand 800*600 images with 50+ classes. I’m wondering GTX970 will be sufficient to try different networks and algorithms, including CNN.
  • Reply
  • Tim Dettmers says
  • 2015-05-01 at 04:45
  • Although your data set is very small and you will only be able to train a small convolutional net before you overfit the size of the images is huge. Unfortunately, the size of the images is the most significant memory factor in convolutional nets. I think a GTX 970 will not be sufficient for this.
  • However, keep in mind, that you can always shrink the images to keep them manageable. for a GTX 970 you will need to shrink them to about 250*190 or so.
  • Reply
  • sacherus says
  • 2015-05-05 at 22:44
  • Hi Tim,
  • thank you for your great article. I think it’s cover everything that you need to know to start your journey with DL.
  • I’m also grad student (but instead of image processing, I’m in speech processing) and want to buy some machine (I thinking also about Kaggle, but for beginning I could take 20-40 place  ). I want to buy (East Europe) used workstation (without graphics) + used graphics. Probably I will end up with 2 cards in my computer… Maybe 3….
  • Questions:
  • 1) You wrote that you need to have 7 3.0 slots motherboard for 3 GPUs. Isn’t possible to have
  • 16 x | 1x | 16x | 1x (etc) setup? Like in http://www.msi.com/product/mb/Z87-G45-GAMING.html#hero-overview?
  • 2) So there do not exist setups that support 16x/16x (or are to expensive)?
  • 3) I see that computation compatibility also matters. I can buy geforce 780 ti in similar price to gtx 970. 780 ti has better bandwith + more GFLOPS (you never mentioned about FLOPS), but 970 has newer CC + more memory.
  • 4) Maybe I should let go and buy what… 960 or 680 (just start)… However, 970 is not much expensive than those 2. Or just buy whole used PC.?
  • Tim, what do you think?
  • Reply
  • Tim Dettmers says
  • 2015-05-06 at 05:50
  • 1. You are right, a 16x | 1x | 16x | 1x setup will work just as well; I did not thought about that in this way, and I will update my blog with that soon — thanks!
  • 2. I hope I understand you right: You have a total of 40x PCIe lanes supported by your CPU (not the physical slots, but this is sort of the communication wires that are layed from the PCIe slots to the CPU) and your GPUs will use up to 16x (standard mainboards) for that; so 16x 16x is standard if you use 2 GPUs, for 3 GPUs this is 16x8x16 and for 4 GPUs 16x8x8x8. If you mean physical slots, then a 16x | Yx | 16x setup will do, where Y is any size; because most GPUs have a width of two PCIe slots you most often cannot run 2 GPUs on 16x | 16x mainboard slots, sometimes this will work if you use watercooling though (reduces the width to one slot)
  • 3. GFLOPS do not matter in deep learning (its virtually the same for all algorithms), your algorithms will always be limited by bandwidth; the 780 TI has higher bandwidth, but inferior architecture and the GTX 970 would be faster. However, the GTX 780 TI has no gliches, and so I would go with the GTX 780 TI
  • 4. The GTX 680 might be a bit more interesting than the GTX 780 TI if you really want to train a lot of convolutional nets; otherwise a GTX 780 TI is best; if you only use dense networks you might want to go with the GTX 970
  • Reply
  • Florijan Stamenković says
  • 2015-05-11 at 13:49
  • Tim,
  • Thanks for the excellent guide! It has helped us a lot. However, a few questions remain…
  • We plan to build a deep-learning machine (in a server rack) based on 4 Titan cards. We need to select other hardware. Ideally we would put all four cards on a single board with 4x PCIe 3.0 x16. The questions are:
  • 1. If I understand correctly, GPU intercommunication is the bottleneck. Should we go for dual 40-lane CPUs (Xeons only, right?), or take a single i7 and connect the cards with SLI?
  • 2. Will any 4x PCIe 3.0 x16 motherboard do? Is socket 2011 preferable?
  • We plan to use these nets for both constitutional and dense learning. Our budget (everything except the Titans) is around $3000, preferably less, or a bit more if justified. Please advise!
  • Reply
  • Florijan Stamenković says
  • 2015-05-11 at 14:20
  • I just read the above post as well and got some needed information, sorry for spamming. From what I understand, SLI is not beneficial.
  • Should we then go for two weaker Xeons (2620), each with 40 PCIe lanes? Will this be cost-optimal?
  • Thanks,
  • F
  • Reply
  • Tim Dettmers says
  • 2015-05-11 at 14:38
  • 2 CPUs will typically yield no speedup because usually the PCIe networks of each CPU (2 GPUs for each CPU) are disconnected which means that the GPU pairs will communicate through CPU memory (max speed about 4 GB/s, because a GPU pair will share the same connection to the CPU on a PCIe-switch). While it is reasonable for 8 GPUs, I would not recommend 2 CPUs for a 4 GPU setup.
  • There are motherboards that work differently, but these are special solutions which often only come in a package of a whole 8 GPU server rack ($35k-$40k).
  • If you use a single GPU then any motherboard with enough slots and which supports 4 GPUs will do; choose the CPU so that it supports 40 PCIe lanes and you will be ready to go. Socket 2011 has no advantage over other sockets which fulfill these requirements.
  • Regarding SLI: SLI can be used for gaming, but not for CUDA (it would be too slow anyways); so communication is really all done by PCI Express.
  • Hope this helps!
  • Reply
  • Tim Dettmers says
  • 2015-05-12 at 11:58
  • It is quite difficult to say which one is better, because I do not know the PCIe switch layout of the dual CPU motherboard. The most common PCIe switch layout is explained in this article and if the dual CPU motherboard that you linked behaves in a similar way, then for deep learning 2 CPUs will be definitely be slower than 1 CPU if you want to use parallel algorithms across all 4 GPUs; in that case the 1 CPU board will be better. However, this might be quite different for other computing purposes than deep learning and a 2 CPU board might be better for those tasks.
  • Reply
  • Thomas says
  • 2015-05-12 at 23:46
  • Hi Tim,
  • Thank you for all your advice on how to build a machine for DL!
  • You don’t talk about the possibility of using an embedded GPU in the motherboard (or a “small” second GPU) so as to dedicate the “big” GPU to calculus. Could that affect the performance in any way?
  • Also we want to build a computer to reproduce and improve -by making a more complex model- the work of DeepMind about their generalist AI.
  • We were thinking about getting one Titan X with 32G of RAM.
  • Would you have any specific recommendation concerning the motherboard and CPU?
  • Thank you very much
  • Reply
  • Tim Dettmers says
  • 2015-05-13 at 14:45
  • There are some GPUs which are integrated (embedded) in regular CPUs and you can run your monitors on these processors. The effect of this is some saved memory (about a hundred MB for each monitor) but very little computational resources (less than 1 % for 3 monitors). So if you are really short on memory (say you have a GPU with 2 or 3GB and 3 monitors) then this might make good sense. Otherwise, it is not very important and buying a CPU with integrated graphics should not be a deciding factor when you buy a CPU.
  • As I said in the article, you have a wide variety of options for the CPU and motherboard, especially if you will stick with one GPU. In this case you can really go for very cheap components and it will not hurt your performance much. So I would go for the cheapest CPU and motherboard with a reasonable good rating on pcpartpicker.com if I were you.
  • Reply
  • Richard says
  • 2015-05-16 at 16:53
  • Hi Tim,
  • First can I say thanks very much for writing this article – it has been very informative.
  • I’m a first year PhD student. My research is concerned with video classification and I’m looking into using convolutional nets for this purpose.
  • My current system has a Gt 620 which takes about 4 hours to run a lenet5 based network built using theano on MNIST. So I’m looking to upgrade and I have about £1000 to do it with.
  • I’ve allocated about £500 for the gpu but I’m struggling to decide what to get. I’ve discounted the gtx 970 due to the memory problems. I was thinking either gtx 780 (6gb asus version), gtx 980 or two gtx 960’s. What is your opinion on this? I know I can’t use multiple gpus with theano but I could run two different nets at the same time on the 960’s, however would it be quicker just to run each net consecutively on the 980 since its faster. Also there’s the 780 which although would be slower than the 980 it has more ram which would be beneficial For convolutional nets. I looked into buying second hand as you suggested however I’m buying through my university so that isn’t an option.
  • Thanks for your help and for the great article once again.
  • Cheers,
  • Richard
  • Reply
  • Tim Dettmers says
  • 2015-05-17 at 15:39
  • That is really a tricky issue, Richard. If you use convolutional on the spatial dimensions of an image as well as the time dimension, you will have 5 dimensional tensors (batch size, rows, columns, maps, time) and such tensors will use a lot of memory. So you really want a memory card. If you use the Nervana Systems 16-bit kernels you would be able to reduce memory consumption by half; these kernels are also nearly twice as fast (for dense connections there are more than twice as fast). To use the Nervana Systems kernels, you will need a Maxwell GPU (GTX Titan X, GTX 960, GTX 970, GTX 980). So if you use this library a GTX 980 will have “virtually” 8GB of memory, while the GTX 780 has 6GB. The GTX 980 is also much faster than the GTX 780, which further adds to the GTX 980 options. However, the Nervana Systems kernels still lack some support for natural language processing, and overall you will have a far more thorough software if you use torch and a GTX 780. If you think about adding your own CUDA kernels, the Nervana Systems + GTX 980 option may be not so suitable, because you probably will need to handle the custom compiler and program 16-bit floating point kernels (I have not looked at this, but I believe there will be things which makes it more complicated than regular CUDA programming).
  • I think both, GTX 780 and GTX 980 are good options. The final choice is up to you!
  • Hope this helps!
  • Cheers,
  • Tim
  • Reply
  • Richard says
  • 2015-05-20 at 11:50
  • Thanks for the detailed response Tim,
  • Think i’ll go with the 780 for now due to the extra physical memory. Quick follow up question: if I have the money for an additional card in the future would I need to buy the same model. Could I for example have both a GTX 780 and a GTX 980 running in the same machine so that I can have two different models running on each card simultaneously? Would there be any issues with drivers etc? Going to order the parts for my new system tomorrow will post some benchmarks soon.
  • Cheers,
  • Richard
  • Reply
  • Tim Dettmers says
  • 2015-05-20 at 14:01
  • GPUs can only communicate directly if they are based on the same chip (but brands may differ). So for parallelism you would need to get another GTX 780, otherwise a GTX 980 is fine for everything else. Also remember, that new Pascal GPUs will hit around Q3 2016 and those will be significantly faster than any Maxwell GPU (3D memory) — so waiting might be an option as well.
  • Reply
  • Mark says
  • 2015-05-26 at 12:21
  • FYI on Pascal chip from NVIDIA. Speed up over Titan is “up to 5x.” Of this, a 2x speed up will come from the option of switching to using 16 bit floating point in Pascal.
  • The rest of the “up to 10x speed up” comes from the 2x speed up you get from NVLink. Here the comparison is two Pascal versus two Titans. I don’t know what the speed up would be if the Pascals used the same PCI interlink as the Titans or if they could even use the PCI interlink. Hopefully so then a new motherboard would not be necessary.
  • Reply
  • Mark says
  • 2015-05-26 at 12:26
  • That second 10x speed up claim with NVLink is a bit strange bc it is not clear how it is being made.
  • Reply
  • Sinan says
  • 2015-05-29 at 04:11
  • That sounds interesting. Would you mind sharing more details about your G3258-based system?
  • Reply
  • Tim Dettmers says
  • 2015-05-29 at 05:17
  • I do not have a Haswell G3258 and I would not recommend one, as it only runs 16 PCIe 3.0 lanes instead of the typical 40. So if you are looking for a CPU I would not pick Haswell — too new and thus too expensive, and many Haswells do not have full 40 PCIe lanes.
  • Reply
  • Sinan says
  • 2015-05-29 at 05:31
  • Sorry Tim, my comment was meant to be in response to the comment #128 by user “lU” from March 9, 2015 at 10:59 PM. I wonder why it didn’t appear under that one despite having double-checked before posting. I guess it’s the fault of my mobile browser.
  • First of all, thank you for a series of very informative posts, they are all much appreciated.
  • I was planning to go for a single GPU system (GTX 980 or the upcoming 980 Ti) to get started with deep learning, and I had the impression that at $72, this is the most affordable CPU out there.
  • Reply
  • Tim Dettmers says
  • 2015-05-29 at 05:43
  • You’re welcome! I was looking for other options, but to my surprise there were not any in that price range. If you are using only a single GPU and you are looking for the cheapest option, this is indeed the best choice.
  • Reply
  • Frank Kaufmann (@FrankKaufmann76) says
  • 2015-06-09 at 17:52
  • What are your thoughts on the GTX 980 Ti vs. the Titan X? I guess with “980” in your article you referred to the 4 GB models. The 980 Ti has the same Memory Bandwidth as the Titan X, 2GB more memory than a 980 (which should make it better for big convnets), only a few CUDA cores less. And the price difference is 549 USD for a 980 Ti vs 999 USD for the Titan X.
  • Reply
  • Tim Dettmers says
  • 2015-06-15 at 11:52
  • The GTX 980 Ti is a great card and might be the most cost effective card for convolutional nets right now. The 6GB RAM on the card should be good enough for most convolutional architectures. If you will be working on video classification or want to use memory-expensive data sets I would still recommend a Titan X over a 980 Ti.
  • Reply
  • Kai says
  • 2015-06-15 at 11:21
  • Hey Tim! Thanks for these posts, they’re highly, highly appreciated! I’m just starting to get my feet wet in deep learning – is there any way to hook up my Laptop to a GPU (maybe even an external one?) without having to build a PC from scratch so I could start GPGPU programming on small datasets with less of an investment? Does the answer depend on my motherboard?
  • Reply
  • Tim Dettmers says
  • 2015-06-15 at 11:49
  • In that case it will be best to use AWS GPU spot instances which are cheap and fast. External GPUs are available, but they are not an option because the data transfer, CPU -> USB-like-interface -> GPU, is too slow for deep learning. Once you have made some experiences with AWS I would then buy a dedicated deep learning PC.
  • Reply
  • Tim Dettmers says
  • 2015-06-18 at 12:30
  • What you write is all true, but you have to look at it in two different ways, (1) CPU -> GPU, and (2) GPU -> GPU.
  • For CPU -> GPU you will need pinned memory to do asynchronous copies; however, for GPU -> GPU the copy will be automatically asynchronous in most use cases — no pinned GPU memory needed (cudaMemcpy and cudaMemcpyAsync are almost always the same for GPU -> GPU transfers).
  • I turns out that I use pinned memory in my clusterNet project, but it is a bit hidden in the source code: I use it only for batch buffers in my BatchAllocator class, which has an embarrassingly poor design. There I transfer usual CPU memory, to a pinned buffer (while the GPU is busy) and then transfer it in another step asynchronously to the GPU, so that the batch is ready when the GPU needs it.
  • You can also allocate the whole data set as pinned memory, but this might cause some problems, because once pinned, the OS cannot “optimize” the locked in memory anymore which may lead to performance problems if one allocated a chunk of memory too large.
  • Reply
  • Sergii says
  • 2015-06-18 at 13:36
  • Thank you for the reply.
  • Do you know what is the reason for the inability to have overlapping pageable host memory transfer and kernel execution?
  • Reply
  • Tim Dettmers says
  • 2015-06-18 at 13:42
  • It has all to do with having a valid pointer to the data. If your memory is not pinned, then the OS can push the memory around freely to make some optimizations, so you are not certain to have a pointer to CPU memory and thus such transfers are not allowed by the NVIDIA software because they easily run into undefined behaviour. With pinned memory, the memory no longer is able to move, and so a pointer to the memory will stay the same at all times, so that a reliable transfer can be ensured.
  • This is different in GPUs, because GPU pointers are designed to be reliable at all times as long as they stay on some GPU memory, so these problems do not exist for GPU -> GPU transfers.
  • Reply
  • Sergii says
  • 2015-06-18 at 14:11
  • Thanks for the wonderful explanation. But I still have a question. Your previous reply can explain why data transfer with pageable memory can’t be asynchronous with respect to a host thread, but I still do not understand why a device can’t execute kernel while copying data from a host. What is the reason for that?
  • Reply
  • Tim Dettmers says
  • 2015-06-18 at 14:57
  • Kernels can execute concurrently, the kernel just needs to work on a different data stream. In general, each GPU can have 1 Host to GPU, and 1 GPU to GPU transfer active, and execute a kernel concurrently on unrelated data in another stream (by default all operations use a default stream and are thus not concurrent).
  • But you are right that you cannot execute a kernel and a data transfer in the same stream. I assume there are issues with the hardware not being able to resume a kernel once the end of a steam that is being transferred at the very moment is reached (the kernel would need to wait, then compute, then wait, then compute, then wait… — this will not deliver good performance!). So it will be because of this that you cannot run a kernel on partial data.
  • Reply
  • Sergii says
  • 2015-06-19 at 12:11
  • Sorry that my question was confusing.
  • I write simple code which runs axpy cublas kernels and memcpy. As you can see from the profiler ![](http://imgur.com/dmEOZTY,q2HhqlX#1), in case of pinned memory the kernels that were launched after cudMemcpyAsync run in parallel (with respect to transfer process).
  • However, in case of pageable memory, ![](http://imgur.com/dmEOZTY,q2HhqlX) cudMemcpyAsync blocks the host, and I can’t launch the next kernel.
  • In the chapter `Direct memory access (DMA)` you say “…on the third step the reserved buffer is transferred to your GPU RAM without any help of the CPU…”, so why does cudMemcpyAsync block the host until the end of the copy process? What is the reason for that?
  • Reply
  • Tim Dettmers says
  • 2015-06-19 at 13:07
  • The most low-level reason I can think of is, as I said above, that pageable memory is inherently insecure and may be swapped/pushed around at will. If you start a transfer and want to make sure that everything works, it is best to wait until the data is fully received. I do not know about the low level details how the OS and its drivers and routines (like DMA) interact with the GPU. If you want to know these details, I think it would be best to consult with people from NVIDIA directly, I am sure they can give you a technical accurate answer; you might also want to try the developer forums.
  • Reply
  • Zizhao says
  • 2015-06-25 at 13:15
  • Do you think if you have too many monitors, it will occupy too much resources of your GPU already? If yes, how to solve this issue?
  • Reply
  • Tim Dettmers says
  • 2015-06-26 at 07:48
  • I have three monitors with 1920×1080 resolution and the monitors draw about 400 MB. For me I never had any issues with this, but I also had 6GB cards and I did not train models that maxed out my GPU RAM. If you have a GPU with less memory (GTX 980 or GTX 970) then there might be some problems for convolutional nets. The best way to circumvent this problem is to buy a really cheap GPU for the monitors (a GT210 costs about $30 and can power two (three?) monitors), so that your main deep learning GPU is not attached to any monitor.
  • Reply
  • Sameh Sarhan says
  • 2015-07-06 at 20:04
  • Tim, you have a wonderful blog and I am very impressed with the knowledge as well as the effort that you are putting into it.
  • I run a silicon valley startup that works in the space of wearbales Bio-sensing , we developed very unique non-invasive sensors , that can measure vitals , psychological and physiological effects. Most of our signals are multivariate time series, with a typically process (1×3000) per sensor per reading , and we can typically use up to 5 sensors.
  • We are currently expanding our ML algorithms to add CNNs capabilities, I wonder what do you recommend in terms of GPU.
  • Also I would highly appropriate if you can email me to further discuss potentially mutually beneficial collaboration
  • Regards,
  • Sameh
  • Reply
  • Tim Dettmers says
  • 2015-07-08 at 07:24
  • Hi Sameh! If you have multivariate time series a common CNN approach is to use a sliding windows over your data on X time steps. Your convolutional net would then use temporal instead of spatio-temporal convolution which would use much less memory. As such, 6GB of memory should probably be sufficient for such data and I would recommend a GTX 980 Ti, or a GTX Titan. If you need to run your algorithms on very large sliding windows (an important signal happened 120 time steps ago, to which the algorithm should be sensitive to) a recurrent neural network would be best for which 6GB of memory would also be sufficient. If you want to use CNNs with such large windows it might be better to get a GTX Titan X with 12GB memory.
  • Regards,
  • Tim
  • Reply
  • Haider says
  • 2015-07-07 at 01:38
  • Tim,
  • I am new to deep NN. I discovered its tremendous progress after seeing the excellent 2015 GTC NVidia talk. Deep NN will be very useful for my Phd which is about electrical brain signal classification (Brain Computer Interface).
  • What a joy I found your blog! Just wished if you wrote more.
  • All your post are full of interesting ideas. I have checked the comments of the posts which are not less interesting than the posts themselves and full of important hints too.
  • I read a lot, but did not find most of your interesting hints on hardware elsewhere. Your posts were just brilliant. I believe your posts filled a gap in the web, especially on the performance and the hardware side of deep NN.
  • I think on the hardware side, after reading your posts I have enough knowledge to build a good system.
  • On the software side, I found a lot of resources. However, I am still a bit confused. Perhaps, because it wasn’t your posts  . Why do you only write on hardware? Your can write very well, and we love to hear from your experience on software too..
  • From where should I begin?
  • I’m very fond of Matlab and didn’t program much in other languages. And I don’t know anything about python, which seems very important to learn for machine learning. I don’t mind to learn python if you advise me to do so. But if it is not necessary, then maybe I can spare my time to learn other deep NN stuff, which are overwhelming already. My excitement crippled me. I have opened ~600 tabs and want to see them all.
  • If you were in my shoes, what platform you will begin to learn with? Caffe, Torch or Theano ? Why?
  • And please, tell me too about your personal preference. I learned from your posts that you are making your own programs. But in case you are picking one of these for you, what will be. And in case you were like me with no python experience, what will you pick in that case?
  • I am very interested to hear your opinion. I am not in hurry.. When you feel like writing please answer me with some details.
  • I thank you sincerely for all the posts and comment replies in your blog and eager to see more posts from you Tim!
  • Thank you!
  • Reply
  • Tim Dettmers says
  • 2015-07-08 at 07:14
  • Thank you for all this praise — this is encouraging! I wrote about hardware mainly because I myself focused on the acceleration of deep learning and understanding the hardware was key in this area to achieve good results. Because I could not find the knowledge that I acquired elsewhere on a single website, I decided to write a few blog posts about this. I plan to write more about other deep learning topics in the future.
  • In my next posts I will compare deep learning to the human brain: I think this topic is very important because the relationship between deep learning and the brain is in general poorly understood.
  • I also wanted to make a blog post about software, but I did not have the time yet to do that — I will do so probably in this month or the next.
  • Regarding your questions, I would really recommend Torch7, as it is the deep learning library which has the most features and which is steadily extended by facebook and deepmind with new deep learning models from their research labs. However, as you posted above, it is better for you to work on windows and Torch7 does not work well on windows. Theano is the best option here I guess, but also Minerva seems to be okay.
  • Caffe is a good library when you do not want to fiddle around to much within a certain programming language and just want to train deep learning models; the downside is that it is difficult to make changes to the code and the training procedure/algorithm and few models are supported.
  • In the case of brain signals per se, I thin python offers a lot of packages which might be helpful for your research.
  • However, if you just want to get started quickly with the language you know, Matlab, then you can also use the neural network bindings from the Oxford research group, with which you can use your GPU to train neural networks within Matlab.
  • Hope this helps, good luck!
  • Reply
  • Tran Lam An says
  • 2015-07-16 at 04:14
  • Hi Tim,
  • Thank for your support on Deep Learning group.
  • I have a workstation DELL T7610 http://www.dell.com/sg/business/p/precision-t7610-workstation/pd.
  • I want to plug in 2 Titan X from NVIDIA and ASUS. Everything seems okay, I just wonder about PSU, cooling, and dimensions of GPU.
  • I will check the cooling and dimensions latter. My main concerns is about power.
  • I look documents http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x/specifications and https://www.asus.com/Graphics-Cards/GTXTITANX12GD5/specifications/.
  • Both of them requires power up to 300W.
  • However in the specs of workstation, they said sth about graphics card that:
  • Support for up to three6 PCI Express® x16 Gen 2 or Gen 3 cards up to 675W (total for graphics (some restrictions apply))
  • GPU: One or two NVIDIA Tesla® K20C GPGPU – Supports Nvidia Maximus™ technology.
  • So the total power seems okay, right?
  • Another evidences:
  • The power of the workstation would be:
  • Power Supply: 1300W (externally accessible, toolless, 80 Plus® Gold Certified, 90% efficient)
  • CPU ( 230W ) + 2GPU( 300*2 ) + 300 = 1130W.
  • It seems okay for the power, right?
  • Hope to have your opinions.
  • Thank you for your sharing.
  • Reply
  • Tim Dettmers says
  • 2015-07-16 at 05:55
  • Everything looks fine. I ran 3 GTX Titan with a 1400 watt PSU and 4 GTX Titan with 1600 watt, so you should definitely be fine with 1300 watt and 2 GPUs. A GTX Titan also uses more power than a GTX Titan X. Your calculation looks good and there might even be space for a third GPU.
  • P.S. The comments are locked in place to await approval if someone new posts on this website. This is so to prevent spam.
  • Reply
  • Jon says
  • 2015-07-16 at 17:09
  • Will Ecc RAM make Convolution NN or deep learning more efficient or better? In another word, if the same money can buy me one PC with Ecc RAM vs TWO PC without Ecc RAM, which should I pick for deep learning?
  • Reply
  • Tim Dettmers says
  • 2015-07-16 at 17:47
  • I think ECC memory only applies to 64 bit operations and thus would not be relevant to deep learing, but I might be wrong.
  • ECC corrects if a bit is flipped in the wrong way due to physical inconsistencies at the hardware level of the system. Deep learning was shown to be quite robust to inaccuracies, for example you can train a neural network with 8-bits (if you do it carefully and in the right way); training a neural network with 16-bit works flawlessly. Note that training on 8 bit for example, will decrease the accuracy for all data while ECC is relevant only for some small parts of the data. However, a flipped bit might be quite severe while a conversion from 32 to 8-bits might still be quite close to the real value. But overall I think an error in a single bit should not be so detrimental to performance, because the other values might counterbalance this error or in the end softmax will buffer this (an extremely large error value sent to half the connections might spread to the whole network, but in the end for that sample the probability for the softmax will be just 1/classes for each class).
  • Remember that there are always a lot of samples in a batch, and that the error gradients in this batch are averaged. Thus even large errors will dissipate quickly, not harming performance.
  • Reply
  • Charles Foell III says
  • 2015-07-24 at 16:06
  • Hi Tim,
  • 1) Great post.
  • 2) Do you know how motherboards with dedicated PCI-E lane controllers shuffle data between GPUs with deep learning software? For example, the PLX PEX 8747 purports control of 48 PCI-E lanes beyond the 40 lanes a top-shelf CPU controls, e.g. allowing five x16 connections, but it’s not clear to me if deep learning software makes use of such dedicated PCI-E lane controllers.
  • I ask since going beyond three x16 connections with CPU control of PCI-E lanes only requires dual CPU, but such boards along with suitable CPUs can be in sum be thousands of dollars more expensive than a single CPU motherboard that has a PLX PEX 8747 chip. If the latter has as good performance for deep learning software, might as well save the money!
  • Thanks!
  • -Charles
  • Reply
  • Tim Dettmers says
  • 2015-07-24 at 16:21
  • That is very difficult to say. I think the PLX PEX 8747 chip will be handled by the operating system after you installed some driver, so that deep learning software would use it automatically in the background. However, it is unclear to me if you really can operate three GPUs in 16/16/16 when you use this chip, or if it will support peer-to-peer GPU memory transfers. I think you will need to get in touch with the manufacturer for that.
  • Reply
  • Charles Foell III says
  • 2015-07-25 at 00:12
  • Hi Tim, makes sense. Thanks for the reply.
  • I’ll need to dig more. I’ve seen various GPU-to-GPU benchmarks for server-grade motherboards (e.g. in HPC systems), including a raw ~ 7 GB/s using a PLX PEX chip (lower than host-to-GPU), but I’ve had difficulty finding benchmarks for single-CPU boards, let alone for more than three x16 GPU connections.
  • If you come across a success story of a consumer-grade single-CPU system with exceptional transfer speed (better than 40 PCI-E 3.0 lanes worth in sum) between GPUs when running common deep learning software/libraries, or even a system with such benchmarks for raw CUDA functions, please update.
  • In the meantime, I look forward to your other posts!
  • Best,
  • Charles
  • Reply
  • Xardoz says
  • 2015-07-30 at 09:16
  • Very Useful information indeed, Tim.
  • I have a newbie question: If the motherboard has integrated graphics facility, and if the GPU is to be dedicated to just deep learning, should the display monitor be connected directly to the motherboard rather than the GPU?
  • I have just bought a machine with GeForce Titan X card and they just sent me a e-mail saying:
  • “You have ordered a graphics card with your computer and your motherboard comes supplied with integrated graphics. When connecting your monitor it is important that you connect your monitor cable to the output on the graphics card and NOT the output on the motherboard, because by doing so your monitor will not display anything on the screen.”
  • Intuitively,it seems that off-loading the display duties to the motherboard will free the GPU to do more important things. Is this correct? If so, do you think that this can be done simply? I would ask the supplier, but they sounded lost when I started talking about deep learning on Graphics cards.
  • Regards
  • Xardoz
  • Reply
  • Tim Dettmers says
  • 2015-07-30 at 10:41
  • Hi Xardoz! You will be fine when you use connect your monitor to your GPU especially if your using a GTX Titan X. The only significant downside of this is some additional memory consumption which can be a couple of hundred MB. I have 3 monitors connected to my GPU(s) and it never bothered me doing deep learning. If you train very large convolutional nets that are on the edge of the 12GB limit, only then I would think about using the integrated graphics.
  • Reply
  • Xardoz says
  • 2015-08-05 at 08:24
  • Thanks Tim.
  • It seems that my motherboard graphics capability (Asus Z97-P with an Intel i7-4790k) is not available if a Graphics card is installed.
  • And yes I do need more than 12GB for training a massive NN! so I decided to buy a small Graphics card just to run the display as suggested in one of your comments above. Seems to work fine.
  • Regards
  • Reply
  • Mohamad Ivan Fanany says
  • 2015-07-31 at 09:03
  • Hi Tim, very nice sharing. I just would like to comment on the ‘silly’ parts (smile): the monitors. Since I only have one monitor, I just use NoMachine and put the screen in one of my virtual workspaces in ubuntu to switch between the current machine and our deep learning servers. Surprisingly this is more convenient and energy efficient both for the electricity and our neck movement. Just hope this would help especially those who only have single monitor. Cheers.
  • Reply
  • Tim Dettmers says
  • 2015-07-31 at 09:15
  • Thanks for sharing your working procedure with one monitor. Because I got a second monitor early, I kind of never optimized the workflow on a single monitor. I guess when you do it well, as you do, one monitor is not so bad overall — and it is also much cheaper!
  • Reply
  • Vu Pham says
  • 2015-08-04 at 16:07
  • So, I do a research on deep learning hardware, I assume the most appropriate Part list is:
  • Motherboard: X10DRG-Q – This is an dual socket board which alow you to double the lane of the cpu. It has 4x fully functional x16 PCI Ex 3.0 Slot and an extra 4 x PCI Ex 2.0 Slot for a Mellanox card.
  • CPU: 2X E5-2623
  • Network card: Mellanox ConnectX-3 EN Network Adapter MCX313A-BCBT
  • Star of the show: 4x TitanX
  • Assume the other parts are $1000, total cost would be $7,585, half the price of the Nvidia Dev box. My god NVIDIA.
  • Reply
  • Tim Dettmers says
  • 2015-08-04 at 16:44
  • This sounds like a very good system. I was not aware of the X10DRG-Q motherboard; usually such mainboards are not available for private customers — this is a great board!
  • I do not know the exact topology of the system compared to the Nvidia Dev box, but if you have two CPUs this means you will have an additional switch between the two PCIe networks and this will be a bottleneck where you have to transfers GPU memory through CPU buffers. This makes algorithms complicated and prone to human error, because you need to be careful how to pass data around in your system, that is, you need to take into account the whole PCIe topology (on which network and switch the infiniband card sits etc., on which network the GPU sits etc.). Cuda convnet2 has some 8 GPU code for a similar topology, but I do not think it will work out of the box.
  • If you can live with more complicated algorithms, then this will be a fine system for a GPU cluster.
  • Reply
  • Vu Pham says
  • 2015-08-05 at 10:05
  • I got it, so stick to the old plan then, Thank you any way.
  • Reply
  • Vu Pham says
  • 2015-08-08 at 15:18
  • Hi Tim
  • Fortunately, Supermicro provides me X10DRG-Q mobo diagram, and it would be also a gerneral diagram for other 2011 dual socket mobo which has 4 or morethan 4 PCIEX slot. 2 CPU are connected by 2 QPI – Intel QuickPath Interconnect. If cpu1 has 40 lanes, then 32 lane for 2 PCI ex 16, 4x for 10Gigabit Lan, 4x for a 4x PCI ex (8x slot shape, which will be cover if you install 3rd graphic card). The 2nd cpu also provide 32 lane for pci express, then 8x will be 8x slot on the top slot (nearest cpu socket). Pretty complicated.
  • The point when I build a perfect 4×16 PCIex3.0 is that I though the performence gonna be half if the bandwidth go from 16x down to 8x. Do you have any infomation how much performnce different, said a single titan x, on a 16x 3.0 and 16x 2.0?
  • Reply
  • Tim Dettmers says
  • 2015-08-08 at 15:43
  • Yes, that sounds complicated indeed! A 16x 2.0 will be as fast as a 8x 3.0, so the bandwidth is also halved by stepping down to 2.0. I do not think there exists a single solution which is easy and at the same time cheap. In the end I think the training time will not be that much slower if you run 4 GPUs on 8x 3.0 and with that setup you would not run in any programming problems for parallelism and you will be able to use standard software like Torch7 with integrated parallelism — so I would just go for a 8x 3.0 setup.
  • If you want a less complicated system that is still faster, you can think about getting a cheap InfiniBand FDR card on eBay. That way you would buy 6 cheap GPUs and all hook them up via InfiniBand at 8x 3.0. But probably this will be a bit slower than straight 4x GTX Titan X on 8x 3.0 on a single board.
  • Reply
  • Vu Pham says
  • 2015-08-04 at 16:14
  • I’m so sorry the X3 version of Mellanox does not support RDMA but the X4 does
  • Reply
  • gac says
  • 2015-08-05 at 04:19
  • Hi Tim,
  • First of all, excellent blog! I’m putting together a gpu workstation for my research activities and have learned a lot from the information you’ve provided so …. thanks!!
  • I have a pretty basic question. So basic I almost feel stupid asking it but here goes …
  • Given your deep learning setup which has 3x GeForce Titan X for computational tasks, what are your monitors plugged in to?
  • I would like a very similar setup to yours (except I’ll have two 29″ monitors) and I was wondering if it’s possible to plug these into the Titan cards and have them render the display AND run calculations.
  • Or is it better to just have another, much cheaper, graphics cards which is just for display purposes?
  • Reply
  • Tim Dettmers says
  • 2015-08-05 at 05:42
  • I have my monitors plugged into a single GTX Titan X and I experience no side effects from that other than a couple of hundreds MB memory that is needed for the monitors; the performance for CUDA compute should be almost the same (probably something like 99.5%). So no worries here, just plug them in where it works for you (on windows, one monitor would also be an option I think).
  • Reply
  • pedropgusmao says
  • 2015-08-05 at 08:25
  • Hello Tim,
  • First of all thanks for always answering my questions and sorry for coming back with more
  • Do you think a 980 (4GB) is enough for training current neural nets (alexnet, overfeat, vgg), or would it be wise to go for a 980ti?
  • PS: I am a PhD student, time for me is cheaper than euros
  • Thanks again.
  • Reply
  • howtobeahacker says
  • 2015-08-08 at 08:19
  • Hi, I intend to plug in 2 GPU Titan X in my workstation. In the spec of my workstation, it said that it is possible to have up 2 NVIDIA K20 GPUs. In fact, K20 and TitanX are the same size. However, when I get the first Titan X GPU, I measure that if I plug the second in, there is a tiny space between 2 GPUs. I wonder if it is safe for the cooling of the GPU system.
  • Hope to have your opinion.
  • Thanks
  • Reply
  • Tim Dettmers says
  • 2015-08-08 at 09:15
  • A very tiny space between GPUs is typical for non-tesla cards and your cards should be safe. The only problem is, that your GPUs might run slower because they reach their 80 degrees temperature limit earlier. If you run a unix system, flashing a custom BIOS to your Titans will modify the fan regulation so that your GPUs should be cool (< 80 degrees C) at all times. However, this may increase the noise and heat inside the room where your system is located. Flashing a BIOS for better fan regulation will most and foremost only increase the lifetime of your GPUs, but overall everything should be fine and safe without any modifications even if you operate your cards at maximum temperature for some days without pause (I personally used the standard settings for a few years and all my GPUs are still running well).
  • Reply
  • Tim Dettmers says
  • 2015-08-31 at 10:48
  • Indeed, this will work very well if you have only one GPU. I did not know that there was a application which automatically prepares the xorg config to include the cooling settings — this is very helpful, thank you! I will include that in an update in the future.
  • Reply
  • Axel says
  • 2015-08-08 at 19:17
  • Hi Tim,
  • I’m a Caffe user, and since Caffe has recently added support for multiple GPUs, I have been wondering if I should go with a Titan X or with 2 GTX 980. Which of this 2 configurations would you choose? I’m more inclined towards the 2 GTX 980, but maybe there are some downsides with this configuration that I haven’t thought about.
  • Thanks!
  • Reply
  • Tim Dettmers says
  • 2015-08-09 at 05:02
  • This is relevant. I do not have experience with Caffe parallelism, so I cannot really say how good it is. So 2 GPUs might be a little bit better than I said in the quora answer linked above.
  • Reply
  • Roelof says
  • 2015-08-09 at 20:42
  • Hi Tim,
  • Thanks a lot for your great hardware guide!
  • I’m planning to build a 3 x Titan X GPU setup, which will be more or less running on a constant basis: would you say that water cooling will make a big impact on performance (by keeping the temperatures always below the 80 degrees)?
  • As the machine will be installed remotely, where I don’t have easy access to it, I’m a bit nervous about installing a water cooling system in such a setup, with the risk of cooling leakage, so the “risk” has to be worth the performance gain
  • Do you have any experience with water cooled systems, and would you say that it would be a useful addition ?
  • Also would you advise a nice tightly fit chassis, or a bigger one which allows better airflow ?
  • Finally (so many questions :P), would you think 1500 watt with 92-94% efficiency at 100% load should suffice in the case I might use 4 Titan X GPUs, or would it be better to go for a 1600W PSU ?
  • Reply
  • Tim Dettmers says
  • 2015-08-10 at 04:55
  • If your operate the computer remotely, another option is to flash the BIOS of the GPU and crank up the fan to max speed. This will produce a lot of noise and heat, but your GPUs should run slightly below 80 degrees, or at 80 degrees with little performance lost.
  • Water cooling is of course much superior but if you have little experience with it it might be better to just go with an air cooled setup. I have heard if installed correctly, water cooling is very reliable, so maybe this would be an option when somebody else, how is familiar with water cooling helps you to set it up.
  • In my experience, the chassis does not make such a big difference. It is all about the GPU fans, and getting the heat out quickly (which is mostly towards the back and not through the case). I installed extra fans for better airflow within the case, but this only make a difference of 1-2 degrees. What might help more are extra backplates and small attachable cooling pads for your memory (both about 2-5 degrees).
  • I used a 1600W PSU with 4 GTX Titans which need just as much power as a GTX Titan X and it worked fine. I guess 1500W would also work well and 92-94% efficiency is really good. I would try with the 1500W one and if it does not work just send it back.
  • Reply
  • Roelof says
  • 2015-08-10 at 16:57
  • Thanks for the detailed reponse, I’ve decided to go for:
  • – Chassis: Corsair Carbide Air 540
  • – Motherboard: ASUS X99-E WS
  • – Cpu: Intel(Haswell-e) Core i7 5930K
  • – Ram: 64GB DDR4 Kingston 2133Mhz
  • – Gpu: 3 x NVIDIA GTX TITAN-X 12GB
  • – HD1: 2 X 500GB SSD Samsung EVO
  • – HD2: 3 X 3TB WD Red in RAID 5
  • – PSU: Corsair AX1500i (1500Watt)
  • With a custom build water cooling system for both the Cpu and the 3 Titan X’s, which I hope will let me crank up these babies while keeping the temperature all times below the 80 degrees.
  • The machine is partly (at least the chassis is) inspired by Nvidia’s recently released DevBox for Deep Learning (https://developer.nvidia.com/devbox), but for almost 1/2 of the price. Will post some benchmarks with the newer cuDNN v3 once its build and all setup.
  • Reply
  • Alex says
  • 2015-11-12 at 01:15
  • How did your setup turn out ? I am also looking to either build a box or find something else ready made (if it is appropriate and fits the bill). I was thinking of scaling down the nvidia devbox as well. I also saw these http://exxactcorp.com/index.php/solution/solu_detail/233 which are similar. Very expensive.
  • Why is there no mention of Main Gear https://www.maingear.com/custom/desktops/force/index.php anywhere? Are they no good? The price seems too good to be true. I have heard that they break down, but I have also heard that the folks at Main Gear are very responsive and helpful.
  • Thanks for any insight and thanks Tim for the great blog posts!
  • Reply
  • Florijan Stamenković says
  • 2015-08-11 at 12:34
  • Hi Tim!
  • We’ve already asked you for some advice, an it was helpful… We put together a dev-box in the meanwhile, with 4 Titans inside, it works perfectly.
  • Now we are considering production servers for image tasks. One of them would be classification. Considering the differences between training and runtime (runtime handles a single image, forward prop only), we were wondering if it would be more cost effective to run multiple weaker GPUs, as opposed to fewer stronger ones…. We are reasoning that a request queue consisting of single-image tasks could be processed faster on two separate cards, by two separate processes, then on a single card that is twice as fast. What are your thoughts on this?
  • We’ve run very crude experiments, comparing classification speed of a single image on a Titan machine, vs. 960M equipped laptops. The results were more or less as we expected: Titans are faster, but only about 2x, whereas Titans are 4x more expensive then a GTX960 (which has significantly more GFLOPS then the 960M). In absoulte terms, classification speed on a weaker card is acceptable, we’re wondering about behavior under heavy load.
  • F
  • Reply
  • Tim Dettmers says
  • 2015-08-11 at 14:01
  • Hi Florijan!
  • I think in the end this is a numbers game. Try to overflow a GTX 960M and a Titan with images and see how fast they go and compare that with how fast you need to be. Additionally, it might make sense to run the runtime application on CPUs (might be cheaper and more scalable to run them on AWS or something) and only run the training on GPUs. I think a smart choice will take this into account, and how scalable and usable the solution is. Some AWS CPU spot instances might be a good solution until you see where your project is headed to (that is if a CPU is fast enough for your application).
  • Reply
  • Florijan Stamenković says
  • 2015-08-11 at 14:09
  • Tim,
  • Thanks for your reply. You’re right, it definitively is a numbers game, I guess we will simply need to stress-test.
  • We already tried to run our classifier on the CPU, but classification time was an order of magnitude slower then on the 960M, so that doesn’t seem a good option, especially considering the price of a GTX960 card.
  • We’ll do a few more tests at some point. If we find out anything interesting, I’ll post back here…
  • F
  • Reply
  • howtobeahacker says
  • 2015-08-12 at 05:09
  • Hi Tim,
  • Thank for your responses. I read your posts and I remembered an image of a software in Ubuntu to visualize states of GPU. Something that is similar to Task Manager for CPUs. If you have information, please let me know.
  • Reply
  • howtobeahacker says
  • 2015-08-13 at 07:41
  • Hi Tim,
  • I have a minor question related to 6-pin and 8-pin power connector. It is related to your sentence “One important part to be aware of is if the PCIe connectors of your PSU are able to support a 8pin+6pin connector with one cable”.
  • My workstation has one 8-pin cable to TWO 6-pin cable connectors. Is it possible that we plug into these two 6-pin connectors to power up Titan X which requires 6-pin and 8-pin power connectors? I think I will try it, because I want to plug 2 GPUs Titan X and only this way my workstation can support up two GPUs.
  • Thank you so much.
  • @An
  • Reply
  • Tim Dettmers says
  • 2015-08-13 at 08:13
  • I think this will depend somewhat on how the PSU is designed, but I think you should be able to power two GTX Titan X with one double 6-pin cable, because the design makes it seem that it was intended for just that. Why would they put two 6-pin connectors on a cable if you cannot use them? I think you can find better information if you look up your PSU and see if there is a documentation, specification or something like that.
  • Reply
  • Peter says
  • 2015-08-13 at 16:22
  • Hi Tim,
  • Firstly, thanks for this article; it’s extremely informative (in fact your entire blog makes fascinating reading for me, since I’m very new to neural networks in general).
  • I want to get a more powerful GPU to replace my old Gtx 560 Ti (a great little card, but 1gb of memory is really limiting and I presume it’s pretty slow these days too). Sadly I cannot really afford the GTX Titan X (as much as I’d like to, 1300 CAD is too damn high). The 980 Ti is also a bit on the high end, so I’m looking at the 980, since it’s about 200 CAD cheaper. My question is; how much performance am I gaining from my old 560 Ti to a 980/980 Ti/Titan X? Is the difference in gained speed even that large? If it’s worth saving for the bigger card then I’ll just have to be patient.
  • I’m currently running torch7 and a LSTM-RNN with batches of text, not images, but if I want to do image learning I assume I’d want as much RAM as possible?
  • Cheers
  • Reply
  • Tim Dettmers says
  • 2015-08-16 at 09:39
  • The speedup should be about 4x when you go from a GTX 560 Ti to a GTX 980. The 4GB ram on the GTX 980 might be a bit restrictive for convolutional networks on large image datanets like ImageNet. A GTX Titan X or GTX 980 Ti will only be 50% faster than a GTX 980. If you wait about 14-18 months you can get a new Pascal card which should be at least 12x faster than your GTX 560 Ti. I personally would value getting additional experience now as more important than getting less experience now and faster training in the future — or in other words, I would go for the GTX 980.
  • Reply
  • Peter says
  • 2015-08-17 at 15:36
  • How exactly would I be restricted by the 4GB of ram? Would I simply not be able to create a network with as many parameters, or would there be other negative effects (compared to the 6GB of the 980 Ti)?
  • You’ve mentioned in the past that bandwidth is the most important aspect of the cards, and the 980 Ti has 50% higher bandwidth than the regular 980; would that mean it’s 50% faster too, or are there other factors involved?
  • Reply
  • Tim Dettmers says
  • 2015-08-17 at 16:56
  • Yes, thats correct, if your convolutional network has too many parameters it will not fit into your RAM. Other factors besides memory bandwidth only play a minor role, so indeed, it should be about 50% better performance (not the 33% I quoted earlier (I edited this for correctness just now)).
  • Reply
  • Tori says
  • 2015-08-18 at 00:38
  • Thank you so much for such informative article!
  • How would GTX Titan Z compare to GTX Titan X for the purpose of training a large CNN? Do you think it’s worth the money to buy a GTX Titan Z or is a GTX Titan X good enough? Thanks!
  • Reply
  • Carles Gelada says
  • 2015-08-27 at 17:35
  • I have been looking for an affordable CPU with 40 lanes without luck. Could you give me with a link?
  • I am also curious about the actual performance benefit of 16x vs 8x. If the bottleneck are the DMA writes will the performance reduce by halve?
  • Reply
  • mxia.mit@gmail.com says
  • 2015-09-01 at 21:18
  • Hey Tim,
  • Thank you so much for this great writeup, it’s been pivotal in helping me and my co-founder understand the hardware. We’re a duo from MIT currently working on a venture backed startup bringing deep learning to education, hoping to help at least improve, if not fix, the US education system.
  • Our first build is aiming to be cheap where it can (since both of us are beginners and we need to be frugal with our funding) but future proof enough for us to do harder things.
  • My current build consists of these parts:
  • Mobo: Asus X99-E WS SSI CEB LGA2011-3 Motherboard
  • CPU: Intel Core i7-5820K 3.3GHz 6-Core Processor
  • Vide Card: EVGA GeForce GTX 960 4GB SuperSC ACX 2.0+ Video Card
  • PSU: EVGA 850W 80+ Gold Certified Fully-Modular ATX Power Supply
  • RAM: Corsair Vengeance LPX 16GB (2 x 8GB) DDR4-3000
  • Storage: Sandisk SSD PLUS 240GB 2.5″ Solid State Drive
  • Case: Corsair Air 540 ATX Mid Tower Case
  • Could you look over these and offer any critique? My logic was to have a Mobo and CPU that could handle upgrading to better hardware later, things like the PSU, Ram, and the 960 I’m willing to replace later on.
  • Thank you in advance! Also is there a way we could exchange emails and chat more?
  • Would love any advice we can get from you while we build out our product.
  • Best,
  • Mike Xia
  • Reply
  • Tim Dettmers says
  • 2015-09-02 at 08:31
  • Looks good. The build is a bit more expensive due to the X99 board, but as you said, that way it will be upgradeable in the future which will be useful to ensure good speed of preprocessing the ever-growing datasets. You are welcome to send me an email. My email is firstname.lastname@gmail.com
  • Reply
  • Colin McGrath says
  • 2015-09-02 at 05:23
  • What are your opinions on RAID setups in a deep learning rig? Software-based RAID is pretty crappy in my experience and can cause a lot more problems than it solves. However, RAID controllers take a PCI-E slot which will, fortunately/unfortunately, all be taken by 4 x Gigabyte GTX 980 TI cards. Is it worth running RAID with the software controller? Or is it better just to do full copy clone backups?
  • Reply
  • Tim Dettmers says
  • 2015-09-02 at 08:26
  • I do not think it is worth it. Usually, a common SATA SSD will be fast enough for most kinds of data; in come cases there will be a decrease in performance because the data takes too long to load, but compared to the effort and money spend on a RAID system (hardware) it is just not worth it.
  • Reply
  • Sascha says
  • 2015-09-05 at 16:44
  • Hi,
  • thanks a lot for all this information. After stumbling across a paper from Andrew Ng et al (“Deep learning with COTS HPC systems”) my original plan was to also build a cluster (to learn how it is done). I wanted to go for two machines with a bunch of GTX Titans but after reading your blog I settled with only one pc with two GTX 980s for the time being. My first thought after reading your blog was to actually settle for two 960s but then I thought about the energy consumption you mentioned. Looking at the specifications of the nvidia cards I figured the 980 were the most efficient choice currently (at least as long as you have to pay German energy prices).
  • As I am still relatively fresh to machine learning I guess this setup will keep me busy enough for the next couple of months, probably until the pascal architecture you mentioned is available (I read somewhere 2nd half of 2016). If not then I guess I will buy another PC and move one of the 980s into it so that I can learn how to setup a cluster (my current goal is learning as much as fast as possible).
  • The configuration I went for is as follows:
  • CPU: Intel i7-5930k (I chose this one instead of the much cheaper 5920 as it has the 40 PCI lanes you mentioned, which gives the additional flexibility of handling 4 graphics cards)
  • Mainboard: ASRock Fatal1ty X99 Professional (supports up to 4 graphics cards and has a M.2 slot)
  • RAM: 4×8 GB DDR4-3000
  • Graphics Card: 2x Zotac GTX 980 AMP! Edition
  • Hard Disk: Samsung SSD SM951 with 256 GB (thanks to M.2 it offers 2 GB/s of sequential read performance)
  • Power Supply: be quiet! BN205 with 1200 Watts
  • I hope that installing Linux on the ssd works as I read that the previous version of this ssd mad some problems.
  • Thanks again
  • Sascha
  • Reply
  • Tim Dettmers says
  • 2015-09-06 at 06:35
  • Hi Sascha! Your reasoning is solid and it seems to got a good plan for the future. Your build is good, but as you say, the PCIe SSD could be a bit problematic to set up. Another fact to be aware of is that your GPUs will have a slower connections with that SSD, because the SSD takes away bandwidth from your GPUs (your GPUs will run at 16x/8x instead of 16x/16x). Overall the PCIe SSD would be much faster for common applications, but slower when you use parallelism on two GPUs, it might be better to go for a SATA SSD (if you do not use parallelism that much a PCIe SSD is a solid choice). A SATA SSD will be slower than the PCIe one, but it should be still be faster enough for any deep learning task. However, preprocessing will be slower on this SSD, and this is probably the main advantage of the PCIe SSD.
  • Reply
  • Sascha says
  • 2015-09-06 at 09:49
  • That is an interesting point you make regarding the M.2. I did not realise that this is how the board will distribute the lanes. I figured that as the M.2 only uses 4 lanes the two cards could each run with 16 and if I actually decided to scale up to a quad setup each card eventually would only get 8 lanes.
  • My first idea after reading the comment was to just try the ssd in the additional M.2 PCI 2.0 slot, which is basically a SATA 6 connection but that will not work, as it will not fit because one has the Key B and the other the Key M layout.
  • Do you have an idea about what this actually means for real life performance in deep learning tasks (like x% slower)?
  • Greetings
  • Sascha
  • Reply
  • Tim Dettmers says
  • 2015-09-06 at 10:23
  • When I think about it again, I might be wrong about what I just said. How two GPUs and the PCIe SSD will work depends highly on your motherboard and how the PCIe slots are wired and how the PCIe-switches are distributed. I think with a 40 PCIe lane CPU and a mainboard that supports 16x/16x/8x layout, it should be possible to configure that to use 16 lanes for your GPUs and 8 lanes for your SSD; to use that setup you only need to make sure to plug everything into the right slot (your mainboard manual should state how to do this). I have not looked at your hardware in detail, but I think your hardware supports this.
  • If your motherboard does not support 16x/16x/8x, then your GPU parallelism will suffer from that. Convolutional nets will have a penalty of 5-15% depending on the architecture, recurrent networks may have a low or no penalty (LSTMs) or a high penalty (20-50%) for recurrent nets with many parameters like vanilla RNNs.
  • Reply
  • vinay says
  • 2015-09-08 at 15:25
  • Does anyone know what would be the requirements for prediction clusters? Most articles focus on training aspects but inference/prediction is also important and compute demand for these are little discussed. Can anyone comment on compute demands for prediction? Also, what do you recommend, CPU only, CPU+GPU, or CPU+FPGA, etc for such tasks?
  • Thanks,
  • Vinay
  • Reply
  • Tim Dettmers says
  • 2015-09-08 at 17:15
  • It depends on many factors which is a suitable solution. If you build a web application, how long do you want your user to wait for a prediction (response time)? How many predictions are requested per second in total (throughput)?
  • Prediction is much faster than training, but still a forward pass of about 100 large images (or similar large input data) takes about 100 milliseconds on a GPU. A CPU could do that in a second or two.
  • If you predict one data point at a time a CPU will probably be faster than a GPU (convolution implementations relying on matrix multiplication are slow if the batch sizes are too small), so GPU processing is good if you need high throughput in busy environments, and a CPU for single predictions (1 image should take only 100 milliseconds for a good CPU implementations). Multiple CPU servers might also be an option, and usually they are easier to maintain and cheaper (AWS spot instances for example, also useful for GPU work). Keep in mind that all these these numbers are reasonable estimates only and will differ from the real results; results from a testing environment that simulates the real environment will make it clear if CPU servers or GPU servers will be optimal.
  • I do not recommend FPGA for such tasks since over time, interfaces to FPGA are not easy to maintain and cloud solutions do not exist (as far as I know).
  • Reply
  • Colin McGrath says
  • 2015-09-08 at 18:39
  • I just want to thank you again Tim for the wonderful guide. I do have a couple of hardware utilization questions though. I am trying to figure out how to properly partition my space in ubuntu to handle my requirements. I dual boot Windows 10 (for work/school) and Ubuntu 14.04.3 (deep learning) with each having their own SSD boot drive and HDD storage drive. For starters here’s my setup:
  • – ASRock X99 WS-E
  • – 1x Gigabyte G1 980 ti
  • – 16GB Corsair Vengeance RAM 2133
  • – i7-5930k
  • – 2x Samsung 850 Pro 256GB SSDs (boot drives)
  • – 2x Seagate Barracuda 3TB HDDs (storage drives)
  • My windows install is fine, but I want to be able to store currently unused data in the HDD, stage batches in the SSD then send the batches from SSD to RAM to fully leverage the IOPS gain in a SSD.
  • I currently have Ubuntu partitioned this way, however I’m not entirely sure this will fit my needs. I’m thinking I might want to allocate /home on the HDD due to how ubuntu handles the /home directory in the UI, but I’m unsure if that will be a problem with deep learning:
  • SSD (boot):
  • – swap area – 16GB
  • – / – 20GB
  • – /home – 20GB
  • – /var – 10GB
  • – /boot – 512MB
  • – /tmp – 10GB
  • – /var/log – 10GB
  • HDD
  • – /store 1TB
  • Reply
  • Michael Holm says
  • 2015-09-23 at 21:22
  • Hello Tim,
  • Thank you for your article. The deep learning devbox (NVIDIA) has been touted as cutting edge for researchers in this area. Given your dual experience in both the hardware and algorithm sides, I would be grateful to hear your general thoughts on the devbox. I know it came out a few months after you wrote your article.
  • Thank you!
  • Reply
  • Tony says
  • 2015-09-25 at 18:29
  • Tim, thanks again for such a great article.
  • One concern that I have is that I also use triple monitors for my work setup. However, doesn’t the fact that you’re using triple monitors effect performance of your GPU? Do you recommend buying a cheap $50 gpu for your triple monitor setup and then dedicating your titan x or your more expensive primarily to deep learning? I run Recurrent Neural Nets,
  • Thanks!
  • Reply
  • Tim Dettmers says
  • 2015-09-25 at 19:30
  • Three monitors will use up some additional memory (300-600MB) but should not affect your performance greatly (< 2% performance loss). I recommend getting a cheap GPU for your monitors only if you are short on memory.
  • Reply
  • Tony says
  • 2015-09-28 at 19:00
  • Thanks — that makes alot of sense. I just thought it would affect your bandwidth (as that is usually the bottleneck). I’m currently running the 980 TI — I know it has 336Gb/s. Good to know that it uses some memory though. Appreciate it.
  • Reply
  • ML says
  • 2015-09-28 at 17:02
  • Hello Tim, what about external graphic cards connected through Thunderbolt? Have you looked at those? Could that be a cheap solution without having to build/buy a new system?
  • Reply
  • Tim Dettmers says
  • 2015-09-28 at 17:33
  • I looked at some performance reviews and they state about 70-90% performance for gaming. For deep learning the only performance bottleneck will be transfers from host to GPU and from what I read the bandwidth is good (20GB/s) but there is a latency problem. However, that latency problem should not be too significant for deep learning (unless it’s a HUGE increase in latency, which is unlikely). So if I put these pieces of information together it looks as if an external graphics card via Thunderbolt should be a good option if you have an apple computer and have the money to spare for the suitable external adapter.
  • Reply
  • Safi says
  • 2015-10-04 at 22:54
  • Hi Tim,
  • First thanks a lot for these interesting and useful topics. I am a PhD student i work on Evolutionary ANNs.
  • I want to start using GPUs, my budget can reach 150$ Max.
  • I found in my town a new GTX 750 and GTX 650 Ti. Which one is better and are they supported by cuDNN.
  • Thank
  • Reply
  • Tim Dettmers says
  • 2015-10-05 at 07:45
  • A GTX 750 should be better, and both support cuDNN. However, I would also suggest that you have a look at AWS GPU instances. The instance will be a bit faster and may suit your budget well.
  • Reply
  • Greg says
  • 2015-10-06 at 21:08
  • Hi Tim..
  • Recently I have had a ton of trouble working with Ubuntu 14.04 …installing Cuda, caffe etc. Ubuntu has password locked me out of my system twice and getting all dependencies installed to make caffe to install has been a real problem. It works sometimes …other times it doesn’t work. Ubuntu 14.04 is clearly an unstable OS.
  • I would like your opinion TIm on moving from Linux to Windows for deep learning? What are your thoughts?
  • Thanks in advance…
  • -Greg
  • Reply
  • Tim Dettmers says
  • 2015-10-07 at 10:34
  • I can feel your pain — I have been there too! Ubuntu 14.04 is certainly not intuitive when you are switching from Windows and a simple unseemingly command can ruin your installation. However, I found once you understand how everything is connected in Linux things get easier, make sense, and you no longer will run into errors which break your installations or even our OS. After this point, programming in Linux will be much more comfortable than in Windows due to the ease to compile and install any library. So it may be painful but it is well worth it. You will gain a lot of if you go through the pain-mile — keep it up!
  • Reply
  • Greg says
  • 2015-10-07 at 18:43
  • After Ubuntu 14.04 locking me out 3 times via a booting up and false logon screen… I thought I’d try Ubuntu 15.04. I think the Cuda driver slammed Unity resetting the root password to something other than the password I gave it. I search the web and this is a common problem and there seems to be no fix.
  • I’m running x99 MB, I7 5930, 64 GB ram, and one Titan x. I’ll get a second Titan x when I’m ready for it. I want to create my own NN and nodes but for now I have a ton of learning to do and I need to follow what’s been done so far.
  • Do you use standard libraries and algorithms like Caffe, Torch 7 and Theano via Python? I feel I need to wade through everything to see how it works before using it. Nvidia Digits looks pretty simple working from the GUI but it also looks, from my limited experience, like it’s pretty limited.
  • Reply
  • Tim Dettmers says
  • 2015-10-07 at 18:49
  • Is this because of your x99 board? I never had any problems like that. As for the software, Torch7 and Theano (Keras and derivatives) works just fine for me. I have tried Caffe once and it worked, but I also heard some nightmare stories about installing Caffe correctly. NVIDIA Digits will be just as you described: Simple and fast, but if you want to do something more complex it will just be an expensive fast PC with 4 GTX Titan X.
  • Reply
  • mxia.mit@gmail.com says
  • 2015-10-07 at 20:36
  • Just to tag onto this, I have an X-99 E board, and had some problems on the initial install when trying to boot into ubuntu’s live installer, nothing with the password though. After installing everything worked fine at the OS level. In case this is relevant, reflashing to the latest BIOS helped a lot, but probably won’t help your password problem.
  • Cheers and best of luck!
  • Mike
  • Reply
  • Greg says
  • 2015-10-08 at 05:13
  • Yes, I did the BIOS flash in the beginning.
  • Lastly, I kept testing and found the culprit….when installing Cuda I can’t install the 502 driver that it comes with or the Ubuntu system locks with an unknown password…no matter trying a ton if different ways to install the Cuda driver. I scoured the internet for a solution and there wasn’t one and it looks like no one has put 2 n 2 together about the Cuda driver. It could be a combo of things both hardware and software but it definitely involves this driver the x99 mb, a titian x and Ubuntu 14.04 and 15.04.
  • Thanks.
  • Reply
  • Brent Soto says
  • 2015-10-08 at 16:30
  • Hi Tim, The company that I buy my servers from (Thinkmate) recently sent me an e-mail advertising that they’ve been working with Supermicro to sell servers with support for Titan X. What do you think about this solution? I’ve had a lot of luck with Supermicro servers, and they offer 3 year warranty on the Titans and will match the price if found cheaper elsewhere. Here’s the link: http://www.thinkmate.com/systems/servers/gpx/gtx-titan-x
  • Reply
  • Tim Dettmers says
  • 2015-10-08 at 18:42
  • Hi Brent, I think in terms of the price, you could definitely do better on the 1U model with 4 GTX Titan X. A normal board with 1 CPU will not have any disadvantage compared to the 1U model for deep learning.
  • However, the 4U model is different because it can use 8 GTX Titan X with a fast CPU-to-CPU switch which makes parallelization of 8 GPUs easy and fast. There are only few solutions available that are build like this and come with 8 GTX Titan X — so while the price is high, this will be a rather unique and good solution.
  • Reply
  • Nghia Tran says
  • 2015-10-10 at 07:59
  • Hi Tim,
  • Thank you very much for all the writting. I am an objective C developer but a brand newbie to the deep learning thing and so interested in this area right now.
  • I got a Mac 3.1 and I would like to upgrade the graphic card for having CUDA to run torch7, lua and nn as to learn about this programming. Don’t bother if this should be a Mac card or Windows card.
  • Which one should you recommend? GTX 780Ti?GTX 960 2GB? GTX 980? Tesla M2090(second hand)?
  • Look forward to your advice.
  • Reply
  • Tim Dettmers says
  • 2015-10-10 at 10:40
  • From the cards you listed the GTX 980 will be the best by far. Please also have a look at my GPU guide for more info how to choose your GPU.
  • Reply
  • Nghia Tran says
  • 2015-10-29 at 16:47
  • Thank you very much. I got a generous sponsor to build up a new ubuntu machine with 2 GTX 780 Ti. Should I use the GTX 980 in the new machine to yield better performance than a SLI GTX 780 Ti or let it stay in my Mac?
  • Reply
  • Tim Dettmers says
  • 2015-10-30 at 09:31
  • If you already have the two GTX 780 Ti I would stick with that and only change/add the GPU if you experience RAM shortage for one of your models.
  • Reply
  • Nghia Tran says
  • 2015-10-30 at 12:02
  • Thank you very much Tim. I am looking forward to your further writing.
  • By the way, do you have time to look at the neuro-synaptic chip from IBM yet? Really interested in your “deep analysis” on this as well.
  • Greg says
  • 2015-10-20 at 00:45
  • Hey Tim…
  • Do you have any suggestions for a tutorial for DL using Torch7 and Theano and/or Keras?
  • Thanks
  • Greg
  • Reply
  • BK says
  • 2015-10-21 at 16:50
  • Hi Tim,
  • Great post; In general all of the content on your blog has been fantastic.
  • I’m a little curious about your thoughts on other types of hardware for use in deep learning. I’ve heard a number of people suggest FPGAs to be potentially useful for deep learning(and parallel processing in general) due to their memory efficiency vs. GPUs. This is often mentioned in the context of Xeon Phi….what are your thoughts on this? If true, where does the usefulness lie, in the ‘tracking’ or ‘scoring’ part of deep learning(my perhaps incorrect understanding was GPUs advantage was their use for training as opposed to scoring)?
  • My apologies for what I’m certain are sophomoric questions; I’m trying to wrap my head around these matters as someone new to the subject!
  • Regards,
  • BK
  • Reply
  • Tim Dettmers says
  • 2015-10-26 at 22:02
  • Nonsense, these are great questions! Keep them coming!
  • FPGAs could be quite useful for embedded devices, but I do not believe they will replace GPUs. This is because (1) their individual performance is still worse than an individual GPU and (2) combining them into sets of multiple FPGAs yields poor performance while GPUs provide very efficient interfaces (especially with NVLink which will be available at the end of 2016). GPUs will make a very big jump in 2016 (3D memory) and I do not think FPGAs will ever catch up from there.
  • Xeon Phi is potentially more powerful than GPUs, because it is easier to optimize them at the low level. However, they lack the software for efficient deep learning (just like AMD cards) and as such it is unlikely that one will see Xeon Phis be used for deep learning in the future (unless Intel creates a huge deep learning initiative that rivals the initiative of NVIDIA).
  • Reply
  • BK says
  • 2015-10-28 at 15:29
  • Thanks for the response! That’s very interesting.
  • I wanted to follow up a little bit regarding software development for NVIDIA vs. Intel or AMD. I know how much more developed CUDA libraries are when it comes to Deep learning than OpenCL. What frameworks can I actually run with an intel or AMD architecture? Do torch/caffe/Theano only work on NVIDIA hardware? Once again, my apologies if I’m fundamentally misunderstanding something.
  • One last question, beyond the world of deep learning, what is the perception of xeon phi? It seems hard to find people who are talking with certainty as to what its strengths/applications will be. Is there any consensu on this? what do you think makes most sense for xeon phi as an application?
  • Many thanks!
  • -BK
  • Reply
  • Eric says
  • 2015-10-26 at 08:29
  • Tim,
  • Thank you for the many detailed posts. I am going with a one GPU Titan X water cooled solution based on information here. Does it still hold true that adding a second GPU will allow me to run a second algorithm but that it will not increase performance if only one algorithm is running? Best Regards – Eric
  • Reply
  • Tim Dettmers says
  • 2015-10-26 at 22:16
  • There are now many good libraries which provide good speedups for multiple GPUs. Torch7 is probably the best of them. Look for the Torch7 Facebook extensions and you should be set.
  • Reply
  • Eystein says
  • 2015-11-19 at 19:26
  • Hello! First off, I just want to say this website is a great initiative!
  • I’m going to use Kaldi for speech recognition the next spring in my master thesis. Not knowing exactly what type of DNNs I’ll be implementing, I’m planning for an allround solid, budget GPU. Is the GTX 950 with 2 GB suitable (I haven’t seen this mentioned here)? It only requires a 350 W PSU, which is why I’m considering it. Also I have a Q6600 CPU and a motherboard that has 4 GB RAM as a max, so this is a bit constraining on the overall performance of this setup. And apologies if this is too general a question. I’m just now getting into the field
  • Reply
  • Tim Dettmers says
  • 2015-11-21 at 11:18
  • The GTX 950 2GB variant might be a bit short on RAM for speech recognition if you use more powerful models like LSTMs. The cheapest solution might be to prototype on your CPU and use AWS GPU instances to run the model if everything looks good. This way you need no new computer/PSU and will be able to run large LSTMs and other models. If this does not suit you, a GTX 950 with 4GB of memory might be a good choice.
  • Reply
  • Fusiller says
  • 2015-11-29 at 18:16
  • Just a quick note to say thank you and congrats for this great article.
  • Very nice of you to share your experience on the matter.
  • Regards.
  • Alex
  • Reply
  • Rohit Mundra says
  • 2015-12-22 at 02:48
  • Hey Tim,
  • Thanks for the great article; I have a more specific question though – I’m building an entry-level Kaggle-worthy system using an i7-5820K processor. Since I want to keep my GTX 960’s 4GB memory solely for deep learning, would you recommend I buy an additional (cheaper) graphic card for display or not? I’m considering the GT 610 for this purpose since it’s cheap enough. Also, if I were to do this, where would I specify such a setting (e.g. use GT 610 for display)?
  • Thanks again!
  • Rohit
  • Reply
  • Tim Dettmers says
  • 2015-12-22 at 14:06
  • For most datasets on Kaggle your GPU memory should be okay and using another small GPU for you monitors will not do much. However, if you are doing one of the deep learning competitions and you find yourself short on memory and you think you could improve your score by using a model that is a bit larger then this might be worth it. So I would only consider this option if you really encounter problems where you are short on memory.
  • Also remember that the memory requirements of convolutional nets increases most quickly with the batch-size, so going from a batch-size of 128 to 96 or something similar might also solve memory problems (although this might also decrease your accuracy a bit, its all quite dependent on the data set and problem). Another option would be to use the Nervana system deep learning libraries which can run models in 16-bit, thus halving the memory footprint.
  • Reply
  • JB says
  • 2016-01-10 at 22:01
  • Tim,
  • First of all, thank you for writing this! This post has been extremely helpful to me.
  • I’m thinking about getting a gtx 970 now and upgrading to pascal when it comes out. So, if I never use more than 3.5gb vram at a time, then I won’t see performance hits, correct? I’m building my rig for deep reinforcement learning (mostly atari right now), so my minibatches are small (<2MB), and so are my convnets (<2mill weights). Should I be fine until pascal?
  • I’m trying to decide between these two budget builds: [Intel Xeon e5](http://pcpartpicker.com/p/dXbXjX) and [Intel i5](http://pcpartpicker.com/p/ktnHdC). I’m thinking about going with the Xeon, since it has all 40 pcie lanes if I wanted to do more than two gpus in the future, and it’s a beefier processor. However, I start grad school in the fall, so I’d have university hardware then, and think I’d be more than fine with two gpus for personal experiments in the future. (Or could 4 lanes be enough bandwidth for a gpu?) If I get the i5 I could upgrade the processor without having to upgrade the motherboard if I wanted. The processor just needs to be good enough to run (atari) emulations and preprocess images right now. I can’t really imagine anything but the GPU being the bottleneck, right?
  • Thank you for the help. I’m trying to figure out something that will last me awhile, and I’m not very familiar with hardware yet.
  • Thanks again,
  • – JB
  • Reply
  • Tim Dettmers says
  • 2016-01-25 at 14:04
  • Hi JB,
  • the GTX 970 will perform normally if you stay below 3.5GB of memory. Since your mini-batches are small and you seems to have rather few weights this should fit quite well into that memory. So in your case the GTX 970 should give you optimal cost/performance.
  • Reply
  • Alex Blake says
  • 2016-01-19 at 07:19
  • Hi Tim:
  • Thanks so much for sharing your knowledge!
  • I’ve seen you mentioned that Ubuntu is a good OS..
  • what is the best OS for deep learning?
  • What is a good alternative to Ubuntu?
  • I’d really appreciate your thoughts on this…
  • Reply
  • Tim Dettmers says
  • 2016-01-25 at 14:08
  • Linux based system are currently best for deep learning since all major deep learning software frameworks support linux. Another advantage is, that you will be able to compile almost anything without any problems while on other systems (Mac OS, Windows) there will always be some problems or it may be nearly impossible to configure a system well.
  • Ubuntu is good, because it is widely used, easy to install and configure, and it has some support for their LTS versions which makes it attractive for software developers which target linux systems. If you do not like Ubuntu you can use Kubuntu, or other X-buntu variants; if you like a clean slate and to configure everything they way you like I recommend Arch Linux, but be beware that it will take a while until you configured everything the way it is suitable for you.
  • Reply
  • Lawrence says
  • 2016-02-06 at 22:36
  • Hi Tim,
  • Great website ! I am building a Devbox, https://developer.nvidia.com/devbox.
  • My machine has 4 Titan X cards, Kingston Digital HyperX Predator 480 GB PCIe Gen2 x 4 , Intel Core i7-5930K Haswell-E, and G.SKILL 64GB. I am using ASUS RAMPAGE V extreme motherboard. When I place the last Titan X card on the last slot, my SSD gets disapered from bios. I am not sure I have a PCIe conflict ? Does M.2 can interfere with PCIE_X8_4. What should I do to fix this issue ? Should I change the motherboard, any advice ?
  • Reply
  • Tim Dettmers says
  • 2016-02-07 at 11:46
  • Your motherboard only supports 40 PCIe lanes, which is standard, because CPUs only support a maximum of 40 PCIe lanes. Your 4 Titan X will run in 16x/8x/8x/8x lane mode. You might be able to switch the first GPU to 8x manually, but even then CPUs and motherboards usually do not support a 8x/8x/8x/8x/8x mode (usually two PCIe switches are supported for a single GPU, and a single PCIe switch supports two devices, so you can only run 4 PCIe devices in total). This means that there is probably no possibility to get your PCIe SSD working with 4 GPUs. I might be wrong. To check this it is best to contact your ASUS tech support and ask them if the configuration is possible or not.
  • Reply
  • Bobby says
  • 2016-02-19 at 07:07
  • Hi Tim,
  • Thank you for the wonderful guide.
  • As Lawrence, I’m also building a GPU workstation using https://developer.nvidia.com/devbox as the guide. It mentions that “512GB PCI-E M.2 SSD cache for RAID”. I wonder how to setup this SSD as the cache for RAID, since RAID 5 does not support this as I know. Have you done anything similar? Thank you very much.
  • Reply
  • Tim Dettmers says
  • 2016-02-19 at 16:00
  • Hi Bobby,
  • I have no experience with RAID 5, since usual datasets will not benefit from increased read speeds as long as you have a SSD. I think you will need to change some things in BIOS and then setup a few things for your operating system with a raid manager. I think you will be able to find a tutorial for your OS online so you can get it running.
  • Reply
  • Bobby says
  • 2016-02-21 at 01:12
  • Hi Tim,
  • It seems it’s not related to the RAID. I wonder how to setup an SSD as the cache for a normal HDD. Setting it as the cache for RAID should be similar. With this, I may not need to manually copy my dataset for HDD to SSD before experiment. Thank you.
  • Freddy says
  • 2016-02-08 at 14:15
  • Hey Tim,
  • first of all thank you very much for your great article. It helped me alot to gain some inside in the hardware requirements needed for any DL machine. Over the past several years i only worked with laptops (in freetime) as i had some good machines at work. Now i am planning to set up some system at home to start experimenting on some stuff in my free time. After i read your post and many of the comments i started to create a build (http://de.pcpartpicker.com/p/gdNRQ7) and as you looked over so many systems and gave advices i hoped that you can maybe do it once again
  • I choosed the 970 as a starter, and then wait for the pascal cards comming out later this year. I am also not planning to work with more than 2 gpus in the future at home. And for the monitor. i already have one 24″ at home, so this will just be the 2nd.
  • I dunno, maybe you can look over it and give me some advices or your opionon.
  • Reply
  • Tim Dettmers says
  • 2016-02-09 at 14:20
  • Looks like a solid build for a GTX 970 and also after an upgrade to one or two Pascals this is looking very good.
  • Reply
  • Freddy says
  • 2016-02-09 at 15:48
  • Thanks for the time you are spending, giving so many people advices. It is/was quite hard for me after so many years of laptop use to dive back into hardware specifics. You made it a lot easier with your post. Big thanks again!
  • Reply
  • viper65 says
  • 2016-02-22 at 22:20
  • Nice article!
  • What do you think about HBM? Apart from the size of ram, do you think that fury x has any advantage comparing to 980Ti?
  • Reply
  • Tim Dettmers says
  • 2016-02-23 at 13:05
  • The Fury X definitely has the edge over the GTX 980 Ti in terms of hardware, though in terms of software the AMD still lags behind. This will change quite dramatically once NVIDIA Pascal hits the marked in a few months. HBM is definitely the way to go to get better performance. However, the HBM of NVIDIA offers double the memory bandwidth from the Furx X and Pascal will also allows for 16-bit computations which effectively doubles the performance further. So I would not recommend getting a Fury X, but instead to wait for Pascal.
  • Reply
  • Bobby says
  • 2016-02-23 at 21:37
  • How soon do you think will flagship of Pascal, like Titan X, be on the market? I am not sure if I should wait. Thank you.
  • Reply
  • hroent says
  • 2016-08-12 at 02:34
  • Hi Tim — Thanks for this article, I’ve found it extremely useful (as have others, clearly).
  • You’re probably aware of this, but the new Titan X Pascal cards have very weak FP16 performance.
  • Reply
  • Tim Dettmers says
  • 2016-08-13 at 21:56
  • Yes the FP16 performance is disappointing. I was hoping for more, but I guess we have to wait until Volta is released next year.
  • Reply
  • viper65 says
  • 2016-02-23 at 18:15
  • Thank u. But consider the size of the memory and the brand, I am afraid the price of pascal would far beyond my budget
  • Reply
  • Wajahat says
  • 2016-03-07 at 13:57
  • Hi Tim
  • Thanks a lot for your article. It answered some of my questions. I am actually new to deep learning and know almost nothing of GPUs. But I have realized that I need one. Can you comment on the expected speedup if I use ConvNets on a Titan X than a n intel corei7 4770-3.4 Ghz?
  • Even a vague figure would do the job.
  • Best Regards
  • Wajahat
  • Reply
  • Tim Dettmers says
  • 2016-03-07 at 14:04
  • It depends highly on the kind of convnet you are want to train, but a speedup of 5-15x is reasonable. However, if you can wait a bit I recommend you to wait for Pascal cards which should hit the market in two months or so.
  • Reply
  • Chip says
  • 2016-03-16 at 11:05
  • Hi Tim,
  • Thanks for this excellent primer. I am trying to get a part set and have this so far (http://pcpartpicker.com/p/JnC8WZ) but it has some 2 incompatibility issues. Basically, I want to be working through this 2nd Data Science Bowl (https://www.kaggle.com/c/second-annual-data-science-bowl) as an exercise. I will likely work with a lot of medical image data. Also, I will use this system as an all-purpose computer too (for medical writing), so I’m wondering if I also need to add the USB, HDMI, and DVI connects (I currently also use an Eizo ColorEdge CG222W monitor). Also, I like the idea of 2 hard drives, one for Windows and one for Linux/Ubuntu (or I could partition?) Finally, I use a wireless connect, hence that choice. I would be most grateful if you could help with the 2 incompatibilities, any omissions, and seeing if this system would generally be ok. Thank you in advance for your time.
  • Reply
  • Tim Dettmers says
  • 2016-03-18 at 13:22
  • You can resolve the compatibility issue by choosing a larger mainboard. A larger mainboard should give you better RAM voltage and also fixes the PCIe issue. Although the GTX 680 might be a bit limiting for training state of the art models, it is still a good choice to learn on the Data Science Bowl dataset. Once Pascal hits the market you can easily upgrade and will be able to train all state-of-the-art networks easily and quickly.
  • Reply
  • Chip says
  • 2016-03-20 at 05:15
  • Thank you for this response. I had the GTX 980 selected (in the pcpartpicker permalink), but I may well just wait for the Pascal that you suggested. I read this article (http://techfrag.com/2016/03/18/nvidia-pascal-geforce-x80-x80ti-gp104-gpu-supports-only-gddr5-memory/), however, and suppose I must admit I’m quite confused with the names, the relationship of “Pascal” to GeForce X80, X80Ti & Titan Specs, and also the concern with respect to GDDR5 vs. GDDR5X memory. Is it worth it to wait for one of the GeForce (which I assume is the same as Pascal?) rather than just moving forward with the GTX 980? Will one save money by way of sacrificing something with respect to memory? Please forgive my neophyte nature with respect to systems.
  • Reply
  • Tim Dettmers says
  • 2016-03-20 at 22:02
  • Pascal will be the new chip from NVIDIA which will be released in a few months. It should be designated as GTX 10xx. The xx80 refers to the most powerful GPU consumer model of a given series, e.g. the GTX 980 is the most powerful the 900s series. The GTX Titan is usually the model for professionals (deep learning, computer graphics for industry and so forth).
  • And yes I would wait for Pascal rather than buy a GTX 980. You could buy a cheap small card and sell it once Pascal hits the market.
  • Reply
  • Phong says
  • 2016-03-17 at 23:55
  • You say GTX 680 is appropriate for convnets, however I see GTX 680 just has 2GB RAM which is inadequate for most convnets such as AlexNet and of course VGG variants.
  • Reply
  • Tim Dettmers says
  • 2016-03-18 at 13:15
  • There is also a 4GB GTX 680 variant which is quite okay. Of course a GTX 980 with 6GB would be better, but it is also way more expensive. However, I would recommend one GTX 980 over multiple GTX 680. It is just not worth the trouble to parallelize on these rather slow cards.
  • Reply
  • Chip says
  • 2016-03-20 at 06:41
  • “CPU and PCI-Express. It’s a trap!”
  • I have no idea what that is supposed to mean. Does that mean I avoid PCI express? Or just certain Haswells? What is the point here?
  • Reply
  • Tim Dettmers says
  • 2016-03-20 at 21:59
  • Certain Haswells do not support the full 40 PCIe lanes. So if you buy a Haswell make sure it support it if you want to run with multiple GPUs.
  • Reply
  • Hehe says
  • 2016-03-20 at 12:44
  • Why is the aws g2.8x not enough?
  • It says 60gibs(approx 64gbs) of gpu memory
  • Thanks
  • Reply
  • Tim Dettmers says
  • 2016-03-20 at 22:00
  • The 60GB refers to the CPU memory that the AWS g2.8x has. The GPU memory is 4GB per card.
  • Reply
  • Yi Zhu says
  • 2016-03-26 at 08:34
  • Hi Tim,
  • Thanks for the great post. I am a graduate student, and would like to put together a machine recently. But if I put up a system with i7-5930K CPU, Asus X-99 deluxe MOBO and two titan x GPUs for now, will the pascal GPUs compatible with this configuration? Can I just simply plug in a Pascal GPU when it is released? Thanks a lot.
  • Reply
  • Tim Dettmers says
  • 2016-03-27 at 15:30
  • As far as I understand there will be two different versions of the NVLink interface, one for regular cards and one for workstations. I think you should be alright with your hardware, although you might want to wait for a bit since Pascal will be announced soon and probably ship in May/June.
  • Reply
  • Chip Reuben says
  • 2016-03-27 at 21:48
  • Thanks for the great answers. Do you think that one Titan 12 GM of memory is better than, say, two GTX 980s, or two of the upcoming Pascals (xx80s)? I currently have a system designed that has a motherboard such that has the additional PCIe lanes but that (as I’ve been told by the Puget Systems people) adding a second GPU would slow down things by 2x. So I thought “just get the Titan w/ 12 GB of memory and be done with it.” Do you think that sounds ok? Or do I upgrade the motherboard? I’m thinking that the Titan may be more than I ever need, but unfortunately I do not know. Thank you for your great help and thorough work.
  • Reply
  • Razvan says
  • 2016-03-31 at 18:21
  • Hey Tim,
  • Awesome article. Was curious whether you have an opinion on the Tesla M40 as well.
  • Looks suspiciously similar to the Titan X.
  • Think the “best DL acceleration” claim might be a bit of a marketing gamble?
  • Cheers,
  • –Razvan
  • Reply
  • Tim Dettmers says
  • 2016-04-02 at 09:40
  • This post is getting slowly outdated and I did not review the M40 yet — I will update this post next week when Pascal is released.
  • To answer your question, the Titan X is still a bit faster with 336 GB/s while the M40 sports 288 GB/s. But the M40 has much more memory which is nice. But both cards will be quite slow compared to the upcoming Pascal.
  • Reply
  • Xiao says
  • 2016-04-04 at 03:48
  • Hi Tim,
  • Thanks for the post! Very helpful. Was just wondering what editor (monitor in the center) did you use in the picture showing the three monitors?
  • Reply
  • Tim Dettmers says
  • 2016-04-04 at 07:46
  • That is an AOC E2795VH. Unfortunately they are not sold anymore. But I think any monitor with a good rating will do.
  • Reply
  • David Laxer says
  • 2016-04-04 at 10:56
  • Hi,
  • Thanks for this post. Are there any Cloud solutions yet?
  • I used Amazon g2.2xlarge as well as g2.8xlarge as Spot Instances,
  • however, the GPUs are old, don’t support the latest CUDA features and spot prices
  • have increased.
  • Reply
  • Tim Dettmers says
  • 2016-04-04 at 15:51
  • There are also some smaller providers for GPUs but their prices are usually a bit higher. Newer GPUs will also available via Microsoft Azure N-series sometime soon, and these instances will provide access to high-end GPUs (M60 and K80). I will look into this issue in the next week when I will update my GPU blog post.
  • Reply
  • David Laxer says
  • 2016-04-06 at 18:10
  • Can you recommend a good box which supports:
  • 1. multiple GPUs for deep learning (say the new Nvidia GP100),
  • 2. additional GPU for VR headset,
  • 3. additional GPU for large monitor?
  • Thanks!
  • Reply
  • Matt says
  • 2016-04-05 at 04:49
  • Everyone seems to be using an Intel CPU, but they seem prohibitively expensive if actual clock speed or cache isn’t that important… Would an AMD cpu with 38 lane support work just as well paired with two GPUs?
  • Also, have you experimented with builds using two different GPUs?
  • Reply
  • Tim Dettmers says
  • 2016-04-05 at 20:09
  • Yes, a AMD CPU should work just as well on 2 GPUs as an intel one. However, using two different GPUs will not work if the have different chipsets (GTX 980 + GTX 970 will not work); what will work if you have different vendors (EVGA GTX 980 + ASUS GTX 980 will work with no problems).
  • Reply
  • Matt says
  • 2016-04-05 at 20:48
  • I see – thanks! I’m considering just getting a cheaper gpu to at least get my build started and running and then picking up a Pascal gpu later. My plan was to use the cheaper gpu to drive a few monitors and use the Pascal card for deep learning. That kind of setup should be fine right? In other words, there is only an issue with two different cards if I try to use them both in training, but I’m essentially using just a single gpu for it
  • Reply
  • Steven says
  • 2016-04-08 at 05:23
  • Hi Tim,
  • This post was amazingly useful for me. I’ve never built a machine before and this feels very much like jumping in the deep end. There are two things I’m still wondering about:
  • 1. If I’m using my GPU(s) for deep learning, can I still run my monitor off of them? If not should I get some (relatively) cheap graphics card to run the monitor, or do something else?
  • 2. Do you have any opinion about Intel’s i7-4820K CPU vs. the i7-5820K CPU? There seems to be a speed vs. cache size & cores trade-off here. My impression is that whatever difference there is will be small, but the larger cache size should lead to fewer cache misses, which should be better. Is this accurate?
  • Thanks
  • Reply
  • Steven says
  • 2016-04-09 at 15:46
  • Was just reading through the Q/A’s here and saw your response to Rohit Mundra (2015-12-22) answered my first question.
  • Sorry for the repeat….
  • Reply
  • Tim Dettmers says
  • 2016-04-24 at 08:04
  • No problem, I am glad you made the effort to find the answer in the comment section. Thanks!
  • Reply
  • Chip Reuben says
  • 2016-04-08 at 19:16
  • My guess is that (if done right) the monitor functionality gets relegated to the integrated graphics capability of the motherboard. Just don’t try to stream high-res. video while training an algorithm.
  • Reply
  • Steven says
  • 2016-04-09 at 03:06
  • Ooops – I should have mentioned that the motherboard I’m using is an ASRock Fatal1ty X99 Professional/3.1 EATX LGA2011-3. It doesn’t have an integrated graphics chip.
  • Reply
  • Dorje says
  • 2016-04-09 at 16:41
  • Hi Tim, THANKS for such a great post! and all these responses!
  • I got a question:
  • What if I buy a TX 1 instead of buying a computer ?
  • I will do video or CNN images classification sort things.
  • Cheers,
  • Dorje
  • Reply
  • Tim Dettmers says
  • 2016-04-24 at 08:03
  • Hi Dorje,
  • I also thought about buying a TX1 instead of a new laptop, but then I opted against it. The overall performance on the TX1 is great for a small, mobile, embedded device, but not so great compared to desktop GPUs or even laptop GPUs. There might also be issues if you want to install new hardware because it might not be supported by the Ubuntu for Tegra OS. I think in the end the money is better spend to get a small, cheap laptop and buy some credit for GPU instances on AWS. Soon there will also be high performance instances (featuring the new Pascal P100), so this would also be a good choice for the future.
  • Reply
  • Yi says
  • 2016-04-14 at 07:52
  • Hi Tim,
  • Thanks for the great post. Sorry to bother you again. I just want to ask sth about coolbits option of the GPU cards. Right now, I set it to 12 and I can manually control the fan speed. It works nicely. But I won’t check the temperature all the time and change the fan speed accordingly. So during training, how much percentage of fan speed should I use? 50%, or 80% or an aggressive 90% maybe? Thanks a lot.
  • And if I keep the fan always running at 80% speed, will it reduce the lifecycle of the card? Thanks.
  • Reply
  • Tim Dettmers says
  • 2016-04-24 at 07:56
  • The life expectancy of the card will increase the cooler you keep it. So if you can you can keep the fan at 100% at all times. However, this of course can problems with noise if the machine is nearby you or other people. For my desktop I keep the fan as low as possible to keep the GPU below 80 degrees C and if I leave the room I just set the fan speed to 100%.
  • Reply
  • Spuddler says
  • 2016-06-11 at 17:41
  • Keep in mind that running your fans at 100% constantly will wear out the fans much faster – although that is better than a dead GPU chip. It can be difficult to find cheap replacement fans for some GPUs, so you should look for cheap ones on alibaba etc. and have a few spares lying around in advance since shipping from china takes weeks.
  • Also, when a fan stops running smoothly, you can usually just buy cheap “ball bearing oil” ($4 on ebay or so) and remove the sticker on the front side of the fan. There will be some tiny holes beneath into which you can simply squirt some of the oil and most likely the fan will run as good as new. Worked out for me so far
  • Reply
  • Raj says
  • 2016-04-18 at 15:12
  • Thanks for the great blog, i learned a lot.
  • For me getting a 40 lane or even 28 lane CPU-MB is out of budget. In my country these parts are rare.
  • I am planning to get a 16 lane CPU. With this i can get MB which has 2xPCIe 3.0 x16. I plan to use single GPU initially. If i want to use 2 GPU’s it has to be x8/x8 configuration. With this configuration is it practical to use 2 GPU’s in the future?
  • My system will likely have i7 6700, Asus Z170-A and Titan X.
  • Cheers,
  • RK
  • Reply
  • Tim Dettmers says
  • 2016-04-18 at 19:55
  • Hi RK,
  • 16 lanes should still work good with 2 GPUs (but make sure the CPU supports x8/x8 lanes — I think every CPU does, but I never used them myself). The transfer to the GPU will be slower, but the computation on the should still be as fast. You probably see a performance drop of 0-5% depending on the data that you have.
  • Reply
  • Eduardo says
  • 2016-04-24 at 10:02
  • Hi, I am a Brazilian student, so everything is way too expensive for me. I will buy a gtx 960 and start of with a single GPU and expand later on. The problem is that intel CPUs with 30+ lanes are WAY too expensive. So I HAVE to go with AMD, but the motherboards for AMD only have PCIe 2.0.
  • My question is: can I get a good performance out of 2 x 960 GPUs on a PCIe 2 .0 x16 mobo? By good I mean equal to a single 960 with x16 on a PCIe 3.0, maybe even a single gtx 980.
  • Reply
  • Tim Dettmers says
  • 2016-04-24 at 13:23
  • Hi, both a Intel CPU with 16 lanes or less (as long as your motherboard supports 2 GPUs) as well as AMD with PCIe 2.0 will be fine. You will not see large decreases in performance. It should be about 0-10% depending on task and deep learning software.
  • If you are short on money it might also be an option to use AWS GPU instances. If you do not train every day this might be cheaper in the end. However, for tinkering around with deep learning a GTX 960 will be a pretty solid option.
  • Reply
  • Dorje says
  • 2016-04-24 at 15:29
  • Thank you very much, Tim.
  • I got a Titan X, hahaha~
  • Cheers,
  • Dorje
  • Reply
  • Lucian says
  • 2016-04-30 at 01:19
  • Hi Tim, great post!
  • Could you talk a bit about having different graphics cards in the same computer? As an extreme example, would having a Titan X, 980 Ti and a 960 be problematic?
  • Reply
  • DD Sharma says
  • 2016-05-05 at 01:39
  • Tim,
  • Any updates to your recommendations based on Skylake processors and specially Quadro GPU’s?
  • Reply
  • Tim Dettmers says
  • 2016-05-08 at 15:00
  • Skylake is not need and Quadro cards are too expensive — so no changes to any of my recommendations.
  • Reply
  • Daniel Rich says
  • 2016-05-07 at 06:23
  • So reading this post that bandwidth is the key limiter makes me think the gtx 1080 with a bandwidth of 320 will be slightly worse for deep learning than a 980 to. Does that sound right?
  • Reply
  • Tim Dettmers says
  • 2016-05-08 at 15:01
  • You cannot compare the bandwidth of a GTX 980 with the bandwidth of a GTX 1080 because the two cards use different chipsets. The GTX 1080 will definitely be faster.
  • Reply
  • Gilbert says
  • 2016-05-07 at 15:50
  • Hi, does the number of CUDA core matter? GTX 1080 will be released already and it has 2500 CUDA cores whereas a GTX 980 TI has about 2800 CUDA cores. Will this affect the speed of training? Or In general GTX 1080 will be faster with is 8 teraflops of performance?
  • Reply
  • Tim Dettmers says
  • 2016-05-08 at 15:07
  • The number of cores does not matter really. It all depends how these cores are integrated with the GPU. The GTX 1080 will be much faster than the GTX Titan X, but it is hard to say by how much.
  • Reply
  • Jerry says
  • 2016-05-08 at 05:20
  • Hi Tim. Thanks for an excellent guide! I was wondering what your opinion is on Nvidia’s new graphics card – Nvidia Geforce GTX 1080. The performance is said to beat the Titan X and is proposed to be half the price!
  • Reply
  • Bob Sperry says
  • 2016-05-10 at 00:57
  • Hi Tim,
  • I suppose this is echoing Jeremy’s question, but is there any reason to prefer a Titan X to a GTX 1080 or 1070? The only spec where the Titan X still seems to perform better is in memory (12 GB vs. 8 GB).
  • I got a Titan X on Amazon about 2.5 weeks ago, so have about 10 days to return it for a full refund and try for a GTX 1080 or 1070. Is there any reason not to do this?
  • Reply
  • Tim Dettmers says
  • 2016-05-10 at 12:58
  • No performance data is currently in deep learning is currently available for the GTX 1000s, but it is rather safe to say that these will yield much better performance. If you use 16bit, and probably most libraries will change to that soon, you will see in increase of at least 2 times in performance. I think returning your Titan X is a good idea.
  • Reply
  • Spuddler says
  • 2016-06-11 at 17:49
  • Just wanted to add that Nvidia artificially crippled the 16bit operation on the 1070/1080 GTX to abysmal speeds, so we can only hope they don’t do the same with the Pascal Titan card.
  • Reply
  • DD Sharma says
  • 2016-05-13 at 15:15
  • Hello Tim,
  • Comparing two cards for GPGPU (Deep Learning being an instance of a GPGPU) what is more important: # of cores or memory? For learning purposes and may be some model dev I am considering a low end card (512 cores, 2GB) .. will this seriously cripple me? Other than giving-up performance gains, will it seriously be constraining? I checked research work of folks from 5+ years ago and many in academia used processors with even weaker specs and still got something done. Once I discover that I am doing something real serious I can go to Amazon cloud or get an external GPU (connect via Thunderbolt 3) or build a machine.
  • Reply
  • Tim Dettmers says
  • 2016-05-26 at 11:12
  • Neither cores nor memory is important per se. Cores do not matter really. Bandwidth is important and FLOPS second most important. You need a certain memory to training certain networks. For state of the art models you should have more than 6GB of memory.
  • Reply
  • Thomas R says
  • 2016-05-19 at 15:24
  • Hi Tim, did you connect your 3 monitors to the mainboard/CPU or to your GPU? Does this have an influence on the deep learning computation?
  • Reply
  • Tim Dettmers says
  • 2016-05-26 at 11:08
  • I connected them to two GPUs. It does not really affect performance (maybe 1-3% at most), but it does take up some memory (200-500MB). But overall this effect is neglectable.
  • Reply
  • Greg says
  • 2016-05-28 at 08:35
  • Hey Tim…quick question. Do you have any opinion about the new GeForce GTX 1080s for deep learning?
  • Maybe you already give your opinion but I have missed it.
  • Thanks,
  • Greg
  • Reply
  • Adrian Sarno says
  • 2016-06-09 at 18:48
  • Tim,
  • I’m looking for information on which GPU cards have support for convolutional layers, in particular I was considering a laptop with the GTX 970, but according to your blog above it does not support convolutional nets. Would you ind to explain what does that mean in terms of features and also time performance? Is there a way to know from the spec whether the card is good for conv nets?
  • thanks in advance
  • Reply
  • Tim Dettmers says
  • 2016-06-11 at 16:16
  • Maybe I have been a bit unclear in my post. The GTX 970 supports convolutional nets just fine, but if you use more then 3.5GB of memory you will be slowed down. If you use 16-bit networks though you can still train relatively well sized networks. So a GTX 970 is okay for most non-research, non-I-want-to-get-into-top5-kaggle use-cases.
  • Reply
  • Epenko pentekeningen says
  • 2016-06-09 at 23:11
  • Question: For budgetary reasons i’m looking at an AMD cpu / board combination (4 cores) but that combination has no onboard video.
  • Can the GPU (4GB nvidia 960) which will be used for machine learning also be used at the same time as the videocard (no 3d offcoarse).
  • Does that work or do i need an extra videocard ? Thanks!
  • Reply
  • Tim Dettmers says
  • 2016-06-11 at 16:17
  • Yes, that will work just fine! This setup would be a great setup to get started with deep learning and get a feel for it.
  • Reply
  • Nizam says
  • 2016-06-10 at 11:42
  • This is the most informative blog about building a deep learning blog!
  • Thanks for that.
  • Now that the Nvidia’s 1080, 1070 are launched, which is a better deal for us?
  • two 1070s or one 1 080?
  • Everyone writes in the context of gamers
  • I badly need this communities voice here!
  • Reply
  • Adrian Sarno says
  • 2016-06-12 at 18:59
  • I have a laptop with a NVIDIA Quadro M3000M (4.0GB) GDDR5 PCI-Express, I would like to use it for deep learning, I noticed that no-one mentions Quadro cards in the context of deep learning, is there a design reason why these cards are not used in deep learning?
  • PS: I tried to install ubuntu (all it s versions) and it fails to show the gnome menu, it just shows the background desktop image.
  • Reply
  • Spuddler says
  • 2016-06-12 at 21:48
  • as far as I know, quadro cards are usually optimized for CAD applications, you can use them for deep learning but they will not be as cost efficient as regular geforce cards.
  • Your problem with Ubuntu not booting is a strange one, does not really look like a graphics driver issue since you get a screen. Before googling for more difficult troubleshooting procedures I would try other Ubuntu 14.04 LTS flavours if I were you, like Xubuntu (windows-like, lightweight), Kubuntu (windows-like, fancy) or even Lubuntu (very lightweight). It may just be some arcane issue with Ubuntu’s Gnome Desktop and your hardware.
  • Reply
  • Yasumi says
  • 2016-06-15 at 13:26
  • For deep learning on speech recognition, what do you think of the following specs?
  • It’s going to cost 2928USD. What are your thoughts on this?
  • – INTEL CORE I7-6800K UNLOCKED FOR OC(28lanes)(6 CORE/ 12 THREADS/3.8GHZ) NEW!
  • – XSPC RayStorm D5 Photon AX240 (240L)
  • – ASUS X99-E WS (ATX/4way SLI/8x Sata3/2xGigabit LAN/10xUSB3.0)
  • – 4 x GSKILL RipjawsV RED 2x8GB DDR4 2400mhz (CL15)
  • -ZOTAC GTX1080 8GB DDR5X 256BIT Founder’s Edition (1733/10000)-NEW
  • – SuperFlower Leadex Gold 650W(80+Gold/Full Modular)*5 Years Warranty
  • – CORSAIR AIR 540 BLACK WINDOW
  • – INTEL 540s 480GB 2.5″ Sata SSD (560/480)
  • Reply
  • Tim Dettmers says
  • 2016-06-16 at 16:29
  • This is a good build for a general computation machine. A bit expensive for deep learning, as the performance is mostly determined by the GPU. Using more GPUs and cheaper CPU/Motherboard/RAM would be better for deep learning, but I guess you want to use the PC also for something different than deep learning :). This would be a good PC for kaggle competitions. If you plan on running very big models (like doing research) then I would recommend a GTX Titan X for memory reasons.
  • Reply
  • Glenn says
  • 2016-06-16 at 00:30
  • Thanks for all the info. If I plan to use only one GPU for computation, then would I expect to need two GPUs in my system: one for computation and another for driving a couple of displays? Or can a single GPU be used for both jobs?
  • Reply
  • Tim Dettmers says
  • 2016-06-16 at 16:25
  • A single GPU is fine for both. A monitor will use about 100-300MB of your GPU memory and usually draw an insignificant amount (<2%) of performance. It is also the easier option, so I would just recommend to use a single GPU.
  • Reply
  • Adrian Sarno says
  • 2016-06-16 at 19:28
  • I haven’t bee able to boot up this MSI laptop with any of the flavors of 14.04 (lubuntu, xubuntu, kubuntu, ubuntu) , could it be the SkyLake processor that it is not compatible with 14.04?
  • https://bugzilla.kernel.org/show_bug.cgi?id=109081
  • Looks like I will have to wait until a fix is created for the upstream ubuntu versions or until nvidia updates Cuda to support 16.04. Is there any other thing I can try?
  • Thanks!
  • Reply
  • Tim Dettmers says
  • 2016-06-18 at 22:34
  • Laptops with a NVIDIA GPU in combination with Linux are always a pain to get running properly as it is often is also very dependent on your other hardware in your laptop. I do not have any experience in this case, but you might be able to install 14.04 and then try to patch the kernel with that you need. Not easy to do though.
  • Reply
  • Poornachandra Sandur says
  • 2016-06-17 at 07:22
  • Hi Tim Dettmers,
  • Your blog is awesome. I currently have GeForce GTX 970 on my system , is that sufficient for beginning Convolutional Neural Networks.
  • Reply
  • Tim Dettmers says
  • 2016-06-18 at 22:37
  • A GTX 970 is an excellent option to explore deep learning. You will not be able to train the very largest models, but that is also not something you want to do when you explore. It is mostly learning how to train small networks on common and easy problems, such as AlexNet and similar convolutional nets on MNIST, CIFAR10 and other small data sets, until you get a “feel” for training convolutions nets so that you then can go on with larger models and larger data sets (ResNet on ImageNet for example). So everything is good.
  • Reply
  • peter says
  • 2016-06-19 at 23:15
  • Hello Tim:
  • Thanks for the great post. I built the following PC base on it.
  • CPU: i5 6600
  • Mother board: Z170-p
  • DDR4: 16g
  • GPU: nvidia 1080 founder edition
  • Power: 750W
  • However, after I install 14.04, I can’t get CUDA8.0 and the new driver install(which claim N1080 user has to renew this driver).
  • Is the problem occur because of the other components of the PC like mother board?
  • Thanks!
  • Reply
  • Tim Dettmers says
  • 2016-06-23 at 15:51
  • I have heard that people have problems with Skylake under ubuntu 14.04. But I am not sure if that is really the problem. You can try upgrade to ubuntu 16.04 because the Skylake support is better under that version, but I am not sure if that will help.
  • Reply
  • Milan Ender says
  • 2016-06-21 at 00:12
  • Hey,
  • first of all thanks for the guide, helped me immensely to get some clarity in this puzzle!
  • Couple of questions as I’m a bit too impatient to wait for 1080/70 reviews on this topic:
  • As you stated, bandwidth, memory clock and memory size seem to be one of the most important factors so would it even make sense to put some more money in a solidly overclocked custom GPU? So far I’ll just pick the cheapest solidly cooled one (EVGA ACX 3.0 probably).
  • Also my initial analysis between 1070GTX vs 1080GTX was heavily in favor for the 1080 GTX based on the benchmarks from http://www.phoronix.com/scan.php?page=article&item=nvidia-gtx-1070&num=4 . Though the theoretical TFLOPS SP MIXBENCH results were closely in favor for the 1070 (76.6 €/TFLOP 1080GTX vs 73.9 €/TFLOP 1070GTX) the SHOC on CUDA results in terms of price efficiency were closely in favor for the 1080GTX but more or less the same . However the GDDRX5 on the 1080 GTX seem to seal the deal I guess for deep learning applications? Also I found the 1080 around 6 Watt/TFlops more cost efficient. Am I on the right track here? Maybe the numbers help some others here searching for opinions on that :).
  • Anyways after reading through your articles and some others I came up with this build:
  • http://pcpartpicker.com/list/LxJ6hq . Some comments would be very appreciated  . I feel like the CPU is a bit overkill but it was the cheapest with DDR4 ram and 40 lanes. Maybe not needed though I’m a bit unsure of that.
  • Best regards
  • Reply
  • Arun Das says
  • 2016-06-21 at 02:55
  • Wonderful guide ! Thank you !
  • Reply
  • gameeducationjournal.info says
  • 2016-06-24 at 22:16
  • When I initially left a comment I seem to have clicked on the
  • -Notify me when new comments are added- checkbox and now every time a comment is
  • added I get four emails with the exact same comment.
  • Is there a means you can remove me from that
  • service? Many thanks!
  • Reply
  • Tim Dettmers says
  • 2016-06-25 at 21:32
  • That sounds awful. I will check what is going wrong there. However, I am unable to remove a single user from the subscription. See if you can unsubscribe yourself. Otherwise please contact the jetpack team. Apparently the data is stored by them and the plugin that I use for this blog access that data as you can read here. I hope that will help you. Thanks for letting me know.
  • Reply
  • Rikard Sandström says
  • 2016-06-27 at 11:42
  • Thank you for an excellent post, I keep coming back here for reference.
  • With regards to memory types, what role does GDDR5 vs GDDR5X play? Is this an important differentiator between offerings like 1080 and 1070, or is it not relevant for deep learning?
  • Reply
  • Simon says
  • 2016-07-04 at 15:49
  • Hi
  • Asus spec X99-E WS shows that has a PLX chip that provides a additional 48 PCIe lanes. Getting a i7-6850K with a X99-E WS theoretically gives you 88PCIe lanes in total and that is still plenty to run 4 GPUs all at x16.
  • Is that true for deep learning ?
  • Thx for reply.
  • Reply
  • Tim Dettmers says
  • 2016-07-04 at 19:14
  • I am not exactly sure how this feature maps to the CPU and to software compatibility. From what I heard so far, you can quite reliably access GPUs from very non-standard hardware setups, but I am not so sure about if the software would support such a feature. If the GPUs are not aware of each other on the CUDA level due to the PLX chip, then this feature will do nothing good for deep learning (it would probably be even slower than a normal board, because probably you would need to go through the CPU to communicate between GPUs).
  • But the idea of a PLX chip is quite interesting, so if you are able to find out more information about software compatibility, then please leave a comment here — that would not only help you and me, but also all these other people that read this blog post!
  • Reply
  • Simon says
  • 2016-07-07 at 12:34
  • In general I seek cheaper idea to assembly set without decreased performance.
  • Does NVIDIA coolbits take possibility to decrease GPU heats up ?
  • You wrote about “coolbits” on Ubuntu and problem with headless.
  • Did you hear about DVI or VGA plug dummy, i.e.
  • http://www.ebay.com/itm/Headless-server-DVI-D-EDID-1920×1200-Plug-Linux-Windows-emulator-dummy-/201087766664
  • I think it will be good solution for video card with no monitor attached and no problems with coolbits control.
  • Reply
  • Dante says
  • 2016-07-07 at 20:21
  • Tim,
  • Based on your guide I gather that choosing a less expensive hexa core Xeon cpu with either 28 or 40 lanes will not see a great drop in performance. is that correct? (1-2 GPUs). Can you share your thoughts?
  • Great guides. very helpful for folks getting into Deep learning and trying to figure out what works best for their budget.
  • Dante
  • Reply
  • Tim Dettmers says
  • 2016-07-09 at 16:52
  • Yes that is very true. There is basically no advantage from newer CPUs in terms of performance. The only reason really to buy a newer CPU is to have DDR4 support, which comes in handy sometimes for non-deep learning work.
  • Reply
  • John says
  • 2016-07-07 at 23:19
  • Great article. What would you recommend for a laptop GPU setup rather than a desktop? I see a lot of laptop builds with a 980M or 970M GPU, but is it worth waiting for some variant of the 1080M/1070M/1060M?
  • Reply
  • Tim Dettmers says
  • 2016-07-09 at 16:55
  • A laptop with such a high end graphics card is a huge investment and you will probably use that laptop much longer than people use their desktops (it is much easier to sell your GPU and upgrade for a desktop). I would thus recommend to wait for the 1000M series. It seems it will arrive in some months ahead and the first performance figures show that they are slightly faster than the GTX Titan X — that would be well worth the wait in my opinion!
  • Reply
  • sk06 says
  • 2016-07-09 at 10:28
  • Hi Tim,
  • Thanks for the excellent post. The user comments are also pretty informative. Kudos to all.
  • I recently started shifting my focus from conventional machine learning to Deep Learning. I work in medical imaging domain and my application has a dataset of 50000 color images (5000 per class, 10 classes, size – 512×512). I have a system with Quadro k620 gpu. I want to train state of the art CNN model architectures like Googlenet InceptionV3, VGGnet16, alexnet from scratch. Do the QuadroK620 will be sufficient for training these models. If I have to go for higher end gpu’s, can u please suggest me which card I should go for? (K1080, TitianX, etc). I want to generate the prototypes as fast as possible. Budget is not primary.
  • Reply
  • Tim Dettmers says
  • 2016-07-09 at 17:01
  • A QuadroK620 will not be sufficient for these tasks. Even with very small batch sizes you will hit the limits pretty quickly. I recommend getting a Titan X on eBay. Medical imaging is a field with high resolution images where any additional amount of memory can make a good difference. Your dataset is fairly small though and probably represents a quite difficult task; it might be good to split up the images to get more samples and thus better results (quarter them for example if the label information is still valid for these images) which then in turn would consume more memory. A GTX Titan X should be best for you.
  • Reply
  • David Selinger says
  • 2016-07-24 at 19:17
  • Hey there Tim,
  • Thanks for all the info!
  • I was literally pushing send on an email that said “ORDER IT” to my local computer build shop when nVidia announced the new Titan X Pascal.
  • Do you have any initial thoughts on the new architecture? Especially as it pertains to cooling the NVRAM which usually requires some sort of custom hardware (cooling plate? my terminology is likely wrong here) will that add additional delay after purchasing the new hardware?
  • Thank you sir!
  • Reply
  • Tim Dettmers says
  • 2016-07-25 at 06:02
  • There should be no problems with cooling for the GDDR5X memory with the normal card layout and fans. I know for HBM2 NVIDIA actually designed the memory to be actively cooled, but HBM2 is stacked while GDDR5X is not. Generally GDDR5X is very similar to GDDR5 memory. It will consume less power but also offer higher density, so that on the bottom line GDDR5X should run on the same temperature level or only slightly hotter than GDDR5 memory — no extra cooling required. Extra cooling makes sense if you want to overclock the memory clockrate, but often you cannot get much more performance out of it for how much you need to invest in cooling solutions.
  • Overall the architecture of Pascal seems quite solid. However, most features of the series are a bit crippled due to manufacturing bottlenecks (16nm, GDDR5X, HBM2 all these need their own factories). You can expect that the next line of Pascal GPUs will step up the game by quite a bit. The GTX 11 series probably will feature GDDR5X/HBM2 for all cards and allow full half-float precision performance. So Pascal is good, but it will become much better next year.
  • Reply
  • David Selinger says
  • 2016-07-25 at 22:23
  • Cool thanks. That gave me something to chew on.
  • Last question (hopefully for at least a week : ) ): Do you think that a standard hybrid cooling closed-loop kit (like this one from Arctic: https://www.arctic.ac/us_en/accelero-hybrid-iii-140.html) will be sufficient for deep learning or is a custom loop the only way to go?
  • – VRM: heatsink + fan
  • – VRAM: Heatsink ONLY
  • – GPU: closed-loop water cooled
  • Obviously will have to confirm the physical fit once those specs become more available, but insofar as the approach, I was a little bit concerned about the VRAM.
  • The use case is convolutional networks for image and video recognition.
  • Thanks,
  • Selly
  • Reply
  • bmahak says
  • 2016-07-26 at 01:18
  • I want to build my own deep learning machine using skylake motherboard and cpu. I am planing not to use more then 2 GPUS (GTX 1080). Starting with one GPU first and upgrading to a second one if needed.
  • here is my setup in pcpartpiker: http://pcpartpicker.com/user/bmahak2005/saved/Yn9qqs
  • Please tell me what you think about it.
  • Thanks again for a great article .
  • HB.
  • Reply
  • Tim Dettmers says
  • 2016-08-04 at 06:30
  • The motherboard and CPU combo that you chose only supports 8x/8x speed for the PCIe slots. This means you might see some slowdown in parallel performance if you use both of your GPUs at the same time. The decrease might vary between networks with roughly 0-10% performance loss. Otherwise the build seems to be okay. Personally I would go with a bit more watts on the PSU just to have a save buffer of extra watts.
  • Reply
  • drh1 says
  • 2016-07-31 at 04:50
  • hi tim,
  • thanks for some really useful comments. i have a hardware question. i’ve configured a Windows 10 machine for some GPU computing (not DL) at the moment. I think the hardware issues overlap with your blog, so here goes:
  • the system has a GTX 980 Ti card and a K40 card on an ASUS X-99 Deluxe motherboard. When the system boots up, the 980 (which runs the display as well) is fine, but the K40 gives me “This device cannot start. (Code 10). Insufficient system resources exist to complete the API”. I have the most up-to-date drivers (354.92 for K40, 368.81 for 980).
  • Has anyone configured a system like this, and did they have similar problems? Any ideas will be greatly appreciated.
  • Reply
  • Tim Dettmers says
  • 2016-08-04 at 06:45
  • It might well be that your GPU driver is meddling here. There are separate drivers for Tesla and GTX GPUs and you have the GTX variant installed and thus the Tesla card might not work properly. I am not entirely sure to go around this problem. You might want to configure the system as a headless (no monitor) server with Tesla drivers and connect to it using a laptop (you can use remote desktop using Windows, but I would recommend installing ubuntu).
  • Reply
  • Michael Lanier says
  • 2016-08-03 at 16:33
  • How do the new NVIDIA 10xx compare? I followed through with this guide and ended up getting a GTX Titan. The bandwidth looks slightly higher for the Titan series. Does the architecture affect learning speeds?
  • Reply
  • Tim Dettmers says
  • 2016-08-04 at 06:51
  • The bandwidth is high for all Titans, but their performance is different from architecture to architecture, for example Kepler (GTX Titan) is much slower than Maxwell (GTX Titan X) even though the have comparable bandwidth. So yes the architecture does affect learning speed — quite significantly so!
  • Reply
  • Tim Dettmers says
  • 2016-08-06 at 18:09
  • There are several reasons:
  • – I led a team of 250 in an online community and people often asked me for help and guidance. At first I sometimes lend support and sometimes I did not. However, over time I realized that not helping out can produce problems: Demotivate people from something which they really want to do but do not know how to do, produce defects in the social environment (when I do not help out, others would take example from my actions and do the same) among others. Once I start lending a hand always, I found that I do not lose as much time as I thought I would lose. Due to my vast background knowledge in this online community, it often was faster to help than thinking about if some question or request was worth of my help. I now always help without a second thought or at least start helping until my patience grows tired
  • – Helping people makes me feel good
  • – I was born with genes which make me smart and which make me understand some things easier than others. I feel that I have a duty to give back to those which were less fortunate in the birth lottery
  • – I believe everybody deserves respect. Answering questions which are easy for me to answer is a form of respect
  • I hope that answers your question
  • Reply
  • Andrew says
  • 2016-08-15 at 06:59
  • You are an amazingly good person Tim. The world needs more people like you. Your actions encourage others to behave in a similar way which in turn helps build better online and offline communities. Thank you!
  • Reply
  • Arman says
  • 2016-08-07 at 22:02
  • Thanks for the great guide.
  • I had a question. What is the minimum build that you recommend for hosting a Titan X pascal?
  • Reply
  • Tim Dettmers says
  • 2016-08-08 at 06:48
  • For a single Titan X Pascal and if you do not want to add another card later almost any build will do. The CPU does not matter; you can buy the cheapest RAM and should have at least 16 GB of it (24 GB will be more than enough). For the PSU 600 watts will do; 500 watts might be sufficient. I would buy a SSD if you want to train on large data sets or raw images that are read from disk.
  • Reply
  • Wajahat says
  • 2016-08-11 at 15:27
  • Hi Tim
  • Thanks a lot for your useful blog.
  • I am training CNN on CPU and GPU as well.
  • Although the weights are randomly initialized , but I am setting the random seed to zero in the beginning of the training. Still, I am getting different weights learnt for CPU than GPU. The difference is not huge (e.g. -0.0009 and -0.0059, or 0.0016 and 0.0017), but there is a difference that I can notice. Do y ou have any idea how this could be happening? I know it is a very broad question, but what I want to ask is, is this expected or not?
  • I am using MatlabR2016a with MatConvNet 1.0 beta20 (Nvidia Quadro 410 GPU in Win7 and GTX1080 in Ubuntu 16.04), Corei7 4770 and Corei7 4790.
  • Exactly same data with same network architecture used.
  • Best Regards
  • Wajahat
  • Reply
  • Tim Dettmers says
  • 2016-08-13 at 21:45
  • This can well be true and normal. The seed itself can produce different random numbers on CPU and GPU if different algorithms are used. Convolution on GPUs mal also include some non-deterministic operations (cuDNN 4). When using unit tests to compare CPU and GPU computation, I also often have some difference in output given the same input, thus I assume that there are also small differences in floating point computation (although very small). All this might add up to your result.
  • Reply
  • Vasanth says
  • 2016-08-11 at 17:22
  • Hi Tim,
  • Many thanks for this post, and your patient responses. I had a question to ask – NVIDIA gave away Tesla K40C (which is the workstation version of K40, as I understand) as part of its Hardware Grant Program (I think they are giving TitanX now, but they were giving Tesla K40Cs until recently). It’s not clear to me what workstations from standard OEMs like Dell/HP are compatible with a K40C. I have spoken to a few vendors about compatibility issues, but I don’t seem to get convincing responses with knowledge. I am concerned about buying a workstation, which would later not be compatible with my GPU. Would it be possible for you to share any pointers you may have?
  • Thank you very much in advance.
  • Reply
  • Tim Dettmers says
  • 2016-08-13 at 21:51
  • The K40C should be compatible with any standard motherboard just fine. The compatibility that hardware vendors stress if often assumed for datasets where the cards run hot and need to do so permanently for many months or years. The K40 has a standard PCIe connector and that is all that you need for your server motherboard.
  • Reply
  • chanhyuk jung says
  • 2016-08-16 at 09:02
  • I just started learning about neural networks and I’m looking forward to studying it. I have a gt 620 with a dual core pentium g2020 clocked at 3.3 ghz with 8gb of ram. Would it be better to buy a 1060 and two 8gb rams for the future?
  • Reply
  • Tim Dettmers says
  • 2016-08-17 at 06:15
  • Yes, the GT620 will not support cuDNN which is important deep learning software and makes deep learning just more convenient, because it allows you more freedom in choosing your deep learning framework. You will have less troubles if you buy a GTX 1060. 16GB of RAM will be more than enough, I think even 8GB could be okay. Your CPU will be sufficient, no update required.
  • Reply
  • sk06 says
  • 2016-08-17 at 12:54
  • Hi,
  • I just bought two Supermicro 7048GR-TR server machine with 4 TitanX cards on each machine. Im confused how to configure the server. How many partitions I have to make, how to utilize 256GB SSD drive and two other 4TB hard drives in each machine. The server will be only used for deep learning applications. What deep learning framework should I use (TensorFlow or Caffe or Torch) considering two servers. I work in medical imaging domain. I recently started getting used to deep learning domain. Please help me with your valuable suggestions.
  • Link for server configuration:
  • https://www.supermicro.com.tw/products/system/4u/7048/SYS-7048GR-TR.cfm
  • Thanks and Regards
  • sk06
  • Reply
  • Tim Dettmers says
  • 2016-08-18 at 03:37
  • The servers have a slow interconnect, that is the servers only have a gigabit Ethernet which is a bit too slow for parallelism. So you can focus on setting up each server separately. It depends on your dataset size, but you might want to have the SSD drive dedicated for your datasets, that is, install the OS on the hard drive. If your datasets are < 200GB, you could also install the OS on the SSD to have a smoother user experience. The frameworks all have their pros and cons. In general I would recommend TensorFlow, since it has the fastest growing community.
  • Reply
  • sk06 says
  • 2016-08-24 at 05:10
  • Thanks for the suggestions. I tried training my application with 4 gpus in the new server. To my shock, training the alexnet took 2.30 Hrs with 4 gpus while training the alexnet took 35 mins with single gpu. I used caffe for this. Please let me know where am I going wrong..! The batch size and other parameter settings are same as in the original paper.
  • Thanks and Regards
  • sk06
  • Reply
  • Ionut Farcas says
  • 2016-08-17 at 14:09
  • First of all, really nice blog and well made articles.
  • Do you think that spending 240£ more for a 1070 (2048 CUDA cores) instead of a 1060 (1280 CUDA cores) for a laptop? Does the complexity of the most used deep learning algorithms require the extra 760 CUDA cores?
  • Thank you.
  • Reply
  • Tim Dettmers says
  • 2016-08-18 at 03:40
  • I am not sure how easy it is to upgrade the GPU in the laptop. If it is difficult, this might be one reason to go with the better GPU since you will probably also have it for many years. If it is easy to change, then there is not really a right/wrong choice. It all comes down to preference, what you want to do and how much money you have for your hardware and for your future hardware.
  • Reply
  • Arman says
  • 2016-08-26 at 17:49
  • Hi Tim,
  • I had a question about the new pascal gpu’s. I am debating between Gtx 1080 and Titan X. The price of Titan X is almost double the 1080’s. Excluding the fact that Titan X has 4 more Gb memory, does it provide significant speed improvement over 1080 to justify the price difference?
  • Thanks,
  • Reply
  • Juan says
  • 2016-09-05 at 00:24
  • Hi,
  • I am not Tim (obviously), but as far as I understood from his other post on GPU (http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/) he states that for research level of work it actually is a difference, maxime when you are are using videosets. But for example … “While 12GB of memory are essential for state-of-the-art results on ImageNet on a similar dataset with 112x112x3 dimensions we might get state-of-the-art results with just 4-6GB of memory.”
  • Hope this can help you.
  • Reply
  • DarkIdeals says
  • 2016-09-10 at 05:55
  • If you can afford it the TITAN X is DEFINITELY worth it over the 1080 in most cases. Not only does it have that 12GB of VRAM to work with but it also has features like INT8 (the way i understand it, is that you can store floats as 8 bit integers which helps efficiency etc.. Potentially quite useful) and has 44 TOP units (kinda like ROPs but not for graphic rendering, they are beneficial to Deep Learning though)
  • Basically the TITAN X is literally identical to the $7000 Tesla P100 just without the Double Precision FP64 capability and without HBM2 memory (The TITAN X uses GDDR5X instead, however it’s not much of a difference as the P100’s memory bandwidth even with the HBM is only 540 GB/second whereas the TITAN X is very close at 480 GB/second and hits 530 GB/second when you overclock the memory from 10,000mhz to 11,000mhz so it’s literally no difference really) Other than those things and the certified Tesla Drivers there’s literally no real difference between the P100 and the TITAN X Pascal; which is very important as the Tesla P100 is literally THE most powerful graphic card on the planet right now!
  • The important thing to mention is that Double Precision isn’t really important for Neural nets etc.. that you deal with in Deep Learning; so for $1,200 you are getting the power of the $7,000 monster supercomputer chip of the Tesla P100 just without all the unnecessary server features that Deep Learning doesn’t use.
  • Also, in comparison to the GTX 1080, the TITAN X has a significant advantage in both memory capacity (12GB vs 8GB on 1080), memory bandwidth (530 GB/s when overclocked on the TITAN X, vs 350 GB/s on the 1080 when overclocked…that’s a FIFTY PERCENT increase in memory bandwidth!), and has a massive increase in CUDA cores which is very beneficial (40% more, which when combined with the double memory capacity and 50% higher bandwidth easily nets you ~60% more performance in some scenarios over the 1080)
  • Hope this helps, the TITAN X is a GREAT chip for Deep Learning, the best in the world currently available in my opinion. Which is why i bought two of them.
  • Reply
  • DarkIdeals says
  • 2016-11-09 at 01:32
  • (sorry for the long post but it is important to your decision so try to read it all if you have time)
  • Hey, correcting an error in my earlier post. LIke i said i wasn’t quite sure if i understood the INT8 functionality properly. and i was wrong about it. Apparently there was a typo in the spec pages of the Pascal TITAN X, it said “44 TOPs” and made me think it was an operation pipeline of sorts similar to a “ROP” which is responsible for displaying graphical images etc..
  • It actually was referring the the INT8, which is basically just 8 bit integer support. The average GPU runs with 32 bit “full precision” accuracy, which is a measurement of how much time and effort is put into each “calculation” made by the GPU. For example, with 32 bit it may only go out to 4 decimal points when calculating for the physics of water in a 3d render etc.. which is plenty good for things like Video Games and your average video editing and rendering project; but for things like advanced physics calculations by big universities that are trying to determine the 100% accurate behavior of each individual molecule of H2O within the body of water to see EXACTLY how it moves when wind blows etc.. you would need “double precision” which is a 64 bit calculation that would have much more accuracy, going to more decimal points before deciding that the calculation is “close enough” compared to what 32 bit would.
  • Only special cards like Quadro’s and Tesla’s have high 64 performance, they usually have half the Teraflops of performance at 64 bit mode compared to 32 bit, so a Quadro P6000 (same GPU as the TITAN XP but with full 64 bit support) it has 12 Teraflops of power at 32 bit mode and ~6 Teraflops of power in 64 bit mode. But there is also 16 bit mode, “half precision” for things requiring even less accuracy, INT8 to my understanding is basically “8 bit quarter precision” mode, with even less focus on total mathematical accuracy; and this is useful for Deep Learning as some of the work done doesn’t require that much accuracy,.
  • So, in other words, in 8 bit mode, the TITAN X has “44 Teraflops” of performance.
  • Reply
  • Tim Dettmers says
  • 2016-11-10 at 23:09
  • Your analysis is very much correct. However, for some games there are already some elements which make heavy use of 8-bit Integers. However, before it was not possible to do 8-bit Integer computation, but you had to first convert both numbers to 32-bit, then do the computation, and then convert it back. This would be done implicitly by the GPU so that no programming was necessary. Now the GPU is able to do it on its own. However, the support is still quite limited so you will not the 8-bit deep learning just yet. Probably in a year earliest would be my guess, but I am sure it will arrive at some point.
  • Reply
  • Gilberto says
  • 2016-09-08 at 09:39
  • Hi Tim,
  • first of all thank you for sharing all these precious information.
  • I am new to neural network and python.
  • I want to test some ideas on financial time series.
  • I’m starting to learn python, theano, keras.
  • After reading your article, I decided to upgrade my old pc.
  • I know almost nothing about hardware so I ask you an opinion about it.
  • Current configuration:
  • – Motherboard: Gigabyte GA-P55A-UD3 (specification at: http://www.gigabyte.com/products/product-page.aspx?pid=3439#sp)
  • – Intel i5 2.93 GHz
  • – 8 Gb Ram
  • – GTX 980
  • – PSU power: 550watts
  • I may add:
  • – Ssd Hard Drive (I will install Ubuntu and use it only by command line – not graphical interface)
  • The power supply is powerful enough for the new card?
  • Does the motherboard support the new card?
  • Thank you very much,
  • Gilberto
  • Reply
  • Tim Dettmers says
  • 2016-09-10 at 03:52
  • The motherboard should work, but it will be a bit slower. The PSU is borderline, it might be a bit too few watts or just right, its hard to tell.
  • Reply
  • Tim Dettmers says
  • 2016-09-17 at 16:33
  • It is a bit pricey and there are not much details about the motherboard. Also the GPU might be a bit weak for researchers.
  • I would also encourage you to buy components and build them together on your own. This may seem like a daunting task but it is much easier than it seems. This way you get a high-quality machine that is cheap at the same time.
  • Reply
  • Tim Dettmers says
  • 2016-09-23 at 12:43
  • Your GPU has compute capability of 2.1 and you need at least 3.0 for most libraries — so no, your computer does not support deep learning on GPUs. You could still run deep learning code on the CPU, but it would be quite slow.
  • Reply
  • zac zhang says
  • 2016-09-22 at 11:40
  • Awsome! Thanks for your sharing. Can you tell me how much will them cost to build up such a cluster? Cheers!
  • Reply
  • Tim Dettmers says
  • 2016-09-23 at 12:45
  • Basically it is two regular deep learning systems together with infiniband cards. You can get infiniband card and a cable quite cheap on eBay and the total cost for a 6 GPU, 2 node system would be about 3k for the system and infiniband cards, and an additional 6k for the GPUs (if you use Pascal GTX Titan X) for a total of $9k.
  • Reply
  • Toqi Tahamid says
  • 2016-09-25 at 13:40
  • My current CPU is Intel Core i3 2100 @ 3.1Ghz and RAM is 4GB. My motherboard is Gigabyte GA-H61M-S2P-B3 (rev. 1.0) . It has support PCIe 2.0. Can I use GTX 1060 in my current configuration or do I need to change the board and the CPU? I want to keep the cost as much as low.
  • Reply
  • Tim Dettmers says
  • 2016-09-26 at 12:54
  • You should be able to run a GTX 1060 just fine. The performance should be only 5-10% less than on an optimal system.
  • Reply
  • anon says
  • 2016-10-02 at 01:09
  • Hi Tim,
  • I just got 5 dell precision t7500 in an auction.
  • haven’t received them yet, but the description mentions Nvidia Quadro 5000 installed.
  • Would it be worth replacing them or are they enough for starting out ?
  • Machines themselves have 12GB of DDR3(ECC i presume) RAM and Xeon 5606 as described.
  • Reply
  • Tim Dettmers says
  • 2016-10-03 at 15:04
  • The Quadro 5000 has only a compute capability of 2.0 and thus will not work with most deep learning libraries that use cuDNN. Thus it might be better to upgrade.
  • Reply
  • anon says
  • 2016-10-04 at 19:59
  • Thanks.
  • I am thinking of going with GTX 1060.
  • Is there any difference though between EVGA , ASUS, MSI or NVIDIA versions ?
  • These are the options I see when search on ebay .
  • Reply
  • Gautam Sharma says
  • 2016-11-21 at 16:40
  • That should matter much. Don’t go with the Nvidia founder’s edition. It doesn’t have a good cooling system. Just go with the cheapest one which is EVGA. It is one of the most promising brand. I just ordered the EVGA one.
  • Reply
  • Tim Dettmers says
  • 2016-11-21 at 20:02
  • Please note that the GTX 1080 EVGA has currently cooling problems with are only fixed with flashing the BIOS of the GPU. This card may begin to burn without this BIOS update.
  • anon says
  • 2016-10-19 at 18:07
  • Hi Tim,
  • Could you recommend any Mellanox ConnectX2 cards for GPU-RDMA ?
  • Some are just Ethernet (MNPA19 XTR, for e.g. ) and I wonder if those can be flashed to support RDMA or maybe I should just buy a card which supports Infiniband outright ?
  • Reply
  • Ashiq says
  • 2016-10-19 at 19:06
  • Hi Tim
  • Thanks for the great article and your patience to answer all the questions. I just built a dev box with 4 Titan X Pascal and need some advice on air flow. For reference, here is the Part list: https://pcpartpicker.com/list/W2PzvV and the Picture: http://imgur.com/bGoGVXu
  • Loaded Windows first for stress testing the components and noticed the GPUs temps reached 84C while the fans are still at 50%. Then the GPUs started slowing down to lower/maintain the temp. Then with MSI Afterburner, I could specify a custom temp-vs-fanspeed profile and keep the GPU temps at 77C or below – pretty much what you wrote in the cooling section above.
  • There is no “Afterburner” for Linux, and apparently the BIOS of the Titan X Pascal is locked so we can’t flash them with custom temp setting. The only option left for me is to play with the coolbits and I prefer not to attach 4 monitors to it (I already have two 30inch monitors that are attached to a windows computer that I use for everything. 6 monitors on the table will be too much).
  • I wonder if you found any new way of emulating monitors for Xorg as my preferred option would be keep 3 of the GPUs headless ?
  • Cheers
  • Ashiq
  • Reply
  • Tim Dettmers says
  • 2016-10-24 at 14:16
  • I did not succeed in emulating monitors so myself. Some other claim that they got it working. I think the easiest way to increase the fan speed would be to flash the GPU with a custom BIOs. That way it will work in both Windows and Linux.
  • Reply
  • spuddler says
  • 2016-10-26 at 15:34
  • Not sure, but there maybe there exist specific dummy plugs to help “emulating” monitors, if it’s not possible purely by software. At least DVI and HDMI-dummy plugs worked for cryptocurrency miners back in the day.
  • Reply
  • Ashiq says
  • 2016-10-27 at 03:41
  • So I got it (virtual screens will coolbits) working by following the clues from http://goo.gl/FvkGC7. Here (https://goo.gl/kE3Bcs) is my Xserver cofig file (/etc/X11/xorg.conf) and I can change all 4 fan speeds with nvidia-settings
  • Reply
  • Shashwat Gupta says
  • 2016-10-21 at 14:26
  • Hey, I wanted to ask if the nvidia quadro k4000 will be a good choice for running convolutional nets?
  • Reply
  • Tim Dettmers says
  • 2016-10-24 at 14:08
  • A K4000 will work, but it will be slow and you cannot run big models on large datasets such as ImageNet.
  • Reply
  • Arthur says
  • 2016-10-24 at 22:24
  • Great hardware guide. Thank you for sharing your knowledge.
  • Reply
  • Prasanna Dixit J says
  • 2016-10-28 at 06:38
  • This is good overview on the HW that matters to the DL, Would like your view on the OpenPower -NVIDIA combo, and economics of setting up a ML/DL lab.
  • Reply
  • Tim Dettmers says
  • 2016-11-07 at 11:21
  • I think that non-consumer hardware is not so economically efficient for setting up a ML/DL lab. However, beyond a certain threshold of GPUs the traditional consumer hardware no longer is an option (NVIDIA will not sell you consumer-grade GPUs on bulk and there might also be problems with reliability). I would recommend to get as much traditional, cheap, consumer hardware as possible and mix it with some HPC components like cheap Mellanox Infiniband cards and switches from eBay.
  • Reply
  • Poornachandra Sandur says
  • 2016-10-30 at 09:47
  • Hi Tim,
  • Thank you for sharing your knowledge it was very much beneficial to understand the concepts in DL.
  • I have a doubt
  • How to feed custom images into CNN , for object recognition using Python language.Please give some pointers on this.
  • Reply
  • Tim Dettmers says
  • 2016-11-07 at 11:25
  • You will need to rescale custom images to a specific size so that you can feed your data into a CNN. I recommend looking at ImageNet examples of common libraries (Torch7, Tensorflow) to understand the data loading process. You will then need to write an extension which resizes your images to the proper dimension, for example 1080×1920 -> 224×224.
  • Reply
  • Alisher says
  • 2016-11-08 at 06:36
  • Firstly, I am very thankful for your post. It is very nice and very helpful.
  • One thing I wanted to point is; you can feed the images into network (in caffe) as they are. I mean if you have 1080×1920 image, there is no need to reshape it to 224×224. But, this does not mean that feeding the image as is perform better, I think this can be standalone research topic
  • Secondly, I am planning to buy a desktop PC; and since I am a Deep Learner Researcher (beginner) I am going to do a lot experiments on ImageNet, etc large scale datasets. Do you suggest to buy the gaming PCs directly, or would it be wise choice to build my own PC?
  • I was considering to buy Asus ROG G20CB P1070.
  • Thank you very much in advance!
  • Regards,
  • Reply
  • Tim Dettmers says
  • 2016-11-10 at 23:03
  • Building your own PC would be a better choice in the long-term. It can be daunting at first, but it is often easier than assembling IKEA furniture and unlike IKEA furniture there are multitude of resources on how to do it step-by-step. After you have build your first desktops, building the next desktops will be easy and rewarding and you will save a lot of money to boot!
  • Reply
  • panovr says
  • 2016-11-01 at 02:53
  • Great article, and thanks for sharing!
  • I want to configure my working layout like yours: “Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.”
  • Do I need extra configuration in addition to connect 3 monitors to the motherboard? Is there any additional hardware need for this 3 monitors configuration?
  • Thanks!
  • Reply
  • Tim Dettmers says
  • 2016-11-07 at 11:27
  • No extra configuration is required other than the normal monitor configuration for your operating system. Your GPU needs to have enough connectors and support 3 monitors (most modern GPUs do).
  • Reply
  • Hrishikesh Waikar says
  • 2016-11-04 at 07:58
  • Hi Tim ,
  • Wonderful article . However I am about to buy a new laptop . So what do you feel about the idea of gaming laptop for deep learning with Nvidia GTX 980 M , GTX 1060/1070 ?
  • Reply
  • Tim Dettmers says
  • 2016-11-07 at 11:33
  • Definitely go for the GTX 10 series GPUs for your laptop since these are very similar to full desktop GPUs. They are probably more expensive though. Another option would be to buy a cheap, light laptop with long-battery duration and a separate desktop to which your connect remotely to run your deep learning work. The last option is what I use and I am quite fond of it.
  • Reply
  • Alisher says
  • 2016-11-08 at 06:42
  • I am very happy that I thought as you did. I bought Macbook Air which is very portable, and going to buy a desktop with better specifications to do my experiments on it.
  • I had a question but I have asked it in previous comment.
  • Thank you again for the very useful information.
  • Regards,
  • Reply
  • meskie sprawy says
  • 2016-11-08 at 20:11
  • As I website possessor I believe fpfoggd the content material here is rattling excellent , appreciate it for your hard work. You should keep it up forever! Best of luck.
  • Reply
  • Shahid says
  • 2016-11-10 at 10:51
  • I am confused between two options:
  • 1) A 2nd Generation core i5, 8GB DDR3 RAM and a GTX 960 for $350.
  • 2) A 6th Generation core i3, 16GB DDR3 RAM and a GTX 750Ti for $480.
  • Can you please comment? I expect to upgrade my GPU after a few months.
  • Reply
  • Tim Dettmers says
  • 2016-11-10 at 23:28
  • A difficult choice. If your upgrade your GPU in a few months then it depends if you use your desktop only for deep learning or also for other tasks. If you use your machine regularly, I would spend the extra money and go for option (2). If you want to almost exclusively deep learning with the machine (1) is a good, cheap choice. Here the choice also depends if you buy the 2GB or 4GB variant of each GPU. In terms of speed (1) will be about 33-50% faster, but the speed would not be too important when you start out with deep learning, specially if you upgrade the GPU eventually.
  • Reply
  • Shahid says
  • 2016-11-11 at 06:56
  • Thank you Tim, you really inspire me! Actually I took the Udacity SDCND course, and here is the list of a few projects I want to accomplish on a local machine:
  • 1. Road Lane-Finding Using Cameras (OpenCV)
  • 2. Traffic Sign Classification (Deep Learning)
  • 3. Behavioral Cloning
  • 4. Advanced Lane-Finding (OpenCV)
  • 5. Vehicle Tracking Project (Machine Learning and Vision)
  • So, my work is solely related to Computer Vision and Deep Learning. I also have an option to a GTX 1060 6GB with that core i3 (2). Off course, I expect to code the GPU versions of OpenCV tasks. Do you think this 3rd option would be sufficient to accomplish these projects in average amount of time? Thank you again.
  • Reply
  • Gautam Sharma says
  • 2016-11-21 at 16:52
  • Hi Shahid. I’m in the same boat as yours. Even I have signed up for the SDCND. I have an old PC with core i3 and 2GB RAM. I am adding additional 8GB RAM and buying GTX 1060 6GB. This is a really powerful GPU which’ll perform great in our work associated with the SDCND.
  • Reply
  • JP Colomer says
  • 2016-11-24 at 22:47
  • Hi Tim,
  • Thank you for this excellent guide.
  • I was wondering, now that the new 1000 series and Titan X came out, what are your updated suggestions for GPUs (no money, best performance, etc)?
  • Reply
  • JP Colomer says
  • 2016-12-05 at 07:19
  • Thank you, Tim. I ended up buying a GTX 1070.
  • Now, I have to purchase the MOBO. I’m deciding between a GIGABYTE GA-X99P-SLI and a Supermicro C7X99-OCE-F.
  • Both support 4 GPUs but it seems that there is not enough space for a 4th GPU on the Supermicro. Any experience with these MOBOs?
  • This is my draft https://pcpartpicker.com/list/6tq8bj
  • Reply
  • Tim Dettmers says
  • 2016-12-13 at 12:18
  • Indeed, the Supermicro motherboard will not be able to hold a 4th GPU. I also have a Gigabyte motherboard (although a different one) and it worked well with 4 GPUs (while I had problems with an ASUS one), but I think in general most motherboards will work just fine. So seems like a good choice.
  • Reply
  • Mor says
  • 2016-11-28 at 19:16
  • Hi Tim,
  • I am willing to buy a full hardware to deep learning,
  • my budget is about 15,000$
  • I don’t have any experience in this and when I tried to check things out it was too complicated for me to understand,
  • Can you help me ? maybe recommend about companies or anything else that suits my budget and still be good enough to work with?
  • Thanks a lot
  • Reply
  • Tim Dettmers says
  • 2016-11-29 at 16:03
  • If I were you I would put together a PC on pcpartpicker.com with 4 GPUs and then build it together by myself. This is the cheapest option. If that is too difficult, then I would look for companies that sell deep learning desktops. They basically sell the same hardware, but at a higher price.
  • Reply
  • Gordon says
  • 2016-12-01 at 13:21
  • Thank you very much for writing this! – knowing something about how to evaluate the hardware is something I have been struggling to get my head around.
  • I have been playing with TensorFlow on the CPU on a pretty nice laptop (fast i7 with lots of RAM and an SSD but ultimately dual core so slow as hell).
  • I want try something on the GPU to see if it is really just 100’s of times faster, but I am worried about investing too much too soon as I have not had a desktop in ages.. having read this post and the comments I have the following plan:
  • Use an existing freenas server I have as a test bed and buy a relatively low end GPU – GTX 960 4096MB:
  • https://www.overclockers.co.uk/msi-geforce-gtx-960-4096mb-gddr5-pci-express-graphics-card-gtx-960-4gd5t-oc-gx-319-ms.html
  • The freenas box has a crappy celeron core 2 3.2 dual core and only 8GB of Ram.:
  • http://ark.intel.com/products/53418/Intel-Celeron-Processor-G550-2M-Cache-2_60-GHz
  • I will buy the graphics card and an SSD to install an alternative OS on, I *may* upgrade the ram and processor too as all of these items will all benefit the freenas box anyway (i also run plex on it).
  • If this goes well and I develop further I will look at a whole new setup later with appropriate motherboard, cpu, etc. but in the mean time i can learn how to to identify where my specific bottle necks are likely to be etc.
  • From what you have said here i think there will be several slow parts to my system but I am probably going to get 80-90% of the speed of the graphics, the main restriction being that the cpu only supports PCIe 2.0 – as everything else while not ideal and scale-able for that GPU can probably feed it fast enough.
  • I have 2 questions (if you have time – sorry for long comment but i wanted to make my situation clear):
  • 1. Do you see anything drastically wrong with this approach? – no guarantees obviously, I could spend more money now if i am just shooting myself in the foot but i would rather save it for the next system once i am fully committed and have more experience.
  • 2. I chose the GPU based on RAM, number of CUDA cores and Nvidia compute capability rating (which reminds me of windows performance rating  – a bit vague but better than nothing).. the other one i was considering was this £13 more so also a fine price imho:
  • https://www.overclockers.co.uk/palit-geforce-gtx-1050ti-stormx-4096mb-pci-express-gddr5-graphics-card-gx-03t-pl.html
  • Which has less cores 768 vs 1024 but a shorter process length, higher speed 1290MHz vs 1178MHz, and i *think* i higher rating assuming that the Ti is just better (seems to mean unlocked) 6.1 vs 5.2:
  • https://developer.nvidia.com/cuda-gpus#collapse4
  • Basically is the drop in cores really made up for to such a drastic extent that this significantly higher rating from nvidia is accurate.. noting that i am probably going to be happy enough either way – feel free to just say “either is probably fine”
  • Alternatively if there is something else in the sub £150 ish range that you would suggest given that the whole thing may be replaced by a titan x or similar (hopefully cheaper after Christmas  ) if this goes well. I did consider just getting something like this: much less ram but still was more cores than 2 and allows me to figure out how to get code running on the GPU:
  • https://www.overclockers.co.uk/asus-geforce-gt-710-silent-1024mb-gddr3-pci-express-graphics-card-gx-396-as.html
  • Reply
  • Gordon says
  • 2016-12-02 at 10:49
  • Got the 1050 Ti (well another variation of it), i figured they would be similar regardless so i might as well trust nvidias rating.
  • https://www.amazon.co.uk/gp/product/B01M66IJ55/
  • Also got 32 GB or ram and a quadcore i5 that supports pci 3.0 as they were all cheap on ebay. (SSD too of course).
  • Looks like i can mount my zfs pool in ubuntu so i will probably just take freenas offline for a while and use this as a file and plex server too (very few users anyways) and this way my raid array will be local should i want to use it.
  • Reply
  • Tim Dettmers says
  • 2016-12-02 at 20:18
  • That sounds solid. With that you should easily get started with deep learning. The setup sounds good if you want to try out some deep learning on Kaggle.com for example.
  • Reply
  • Tim Dettmers says
  • 2016-12-02 at 20:16
  • Upgrading the system bit by bit may make sense. Note that CPU and RAM will make no difference to deep learning performance, but might be interesting for other applications. If you only use one GPU a PCIe 2.0 will be fine and will not hurt performance. The GTX 960 and GTX 1050Ti are on a par in terms of performance. So pick what is most convenient / cheaper for you.
  • Reply
  • Tim Dettmers says
  • 2016-12-13 at 12:14
  • Hi Om,
  • I am really glad that you found the resources of my website useful — thank you for your kind words!
  • The thing with the NVIDIA Titan X (Pascal) and the GTX 1080 is that they use different chips which cannot communicate in parallel. So you would be unable to parallelize a model on these two GPUs. However, you would be able to run different models on each GPU, or you could get another GTX 1080 and parallelize on those GPUs.
  • Note that using a Ubuntu VM can cause some problems with GPU support. The last time I checked it was hardly possible to get GPU acceleration running through a VM, but things might have changed since then. So I urge you to check if this is possible first before you go along this route.
  • Best,
  • Tim
  • Reply
  • Table Salt says
  • 2016-12-08 at 15:56
  • Hi Tim, thanks for the excellent posts, and keep up the good work.
  • I am just beginning to experiment with deep learning and I’m interested in generative models like RNNs (probably models like LSTMs, I think). I can’t spend more than $2k (maybe up to $2.3k), so I think I will have to go with a 16-lane CPU. Then I have a choice of either a single Titan X Pascal or two 1080s. (Alternatively, I could buy a 40-lane CPU, preserving upgradability, but then I could only buy a single 1080). Do you have any advice specific to RNNs in this situation? Is model parallelism a viable option for RNNs in general and LSTMs in particular?
  • Thank you!
  • Reply
  • Tim Dettmers says
  • 2016-12-13 at 12:24
  • I think you can apply 75% of state-of-the-art LSTM models on different tasks with a GTX 1080; for the other 25% you can often create a “smarter” architecture which uses less memory and achieves comparable results. So I think you should go for 16 lanes and two GTX 1080. Make sure your CPU support two GPUs in a 8x/8x setting.
  • Reply
  • Nader says
  • 2016-12-11 at 16:10
  • Should i buy a GTX 1080 now or wait the ti which is supposedly coming out next month?
  • Reply
  • Tim Dettmers says
  • 2016-12-13 at 12:36
  • The GTX 1080 Ti will be better in all of the ways. Make sure however to preorder it or something, otherwise all cards might be bought up quickly and you have to go back to the GTX 1080. Another strategy might be to wait a bit longer for the GTX 1080 Ti to arrive and then buy a cheap GTX 1080 from eBay. I think these two choices make sense if you can wait for a month or two.
  • Reply






Leave a Reply