Facebook Trains ImageNet in 1 Hour

tanilama · on June 13, 2017

A practically interesting paper. Some insights:

1.Larger batches requires large learning rate. And this paper shows that learning rate can even scale linearly with the batch size, which leading to extremely large learning rate/batch sizes.

2.Larger batch causes initial learning difficult, so this paper proposes to have a warm-up period where during the initial epochs, the learning rate grows from a smaller value gradually to a larger one.

But if you are not Google/Facebook/Amazon/Microsoft, the experiment setting is unrealistic to you. Best AWS instances didn't come with the 50GBits network. For now, for the others, we would still stick to at most 8 GPUs on a single node, even your soul screams distributed :/

jamesblonde · on June 13, 2017

I disagree that this is not an architecture for others. We are a research lab, and we're building a distributed cluster. Cheap infiniband. Lots of gtx1080Tis. It's not that expensive to have a 40-GPU cluster with 10 4U servers (about 100K Euro).

gm-conspiracy · on June 13, 2017

What mobo/CPU are you using? How much RAM in each server?

Can you provide some more detailed specs?

Thanks!

jamesblonde · on June 13, 2017

CPU is not that important. A motherboard that has PCIe 3.0 and 7 PCIx16 slots is ok for at least 3 GPUs. Each GPU takes typically two slots. Then, a >1400 Watts PSU. Here is a relatively cheap box that also has 8 disk slots (for hadoop): https://exxactcorp.com/index.php/solution/solu_detail/320

Retric · on June 13, 2017

Some people are using water cooling setups to get 7 GPU's on a 7 PCI slot MB. You need to cut off the DVI that would otherwise occupy another slot, but overall it's not that time consuming and reduces your networking needs.

EX: https://www.youtube.com/watch?v=9hsQmcSwGv0

jamesblonde · on June 13, 2017

You can see the potential scale-out you can get for Tensorflow here on this PR:

https://github.com/tensorflow/tensorflow/issues/2916

malux85 · on June 13, 2017

Are you going to rent access to your cluster? :)

jamesblonde · on June 13, 2017

To users in Sweden at www.hops.site It provides Jupyter/Hadoop/Spark/Tensorflow and GPUs (as a resource in YARN).

cs702 · on June 13, 2017

The approach could prove useful in a single machine too -- for example, in cases where increasing batch size might improve the efficiency of shuffling data back and forth from main memory to GPU memory. This provides an easy recipe for trying larger batch sizes to increase GPU usage.

nl · on June 13, 2017

10 node GPU clusters are relatively common at eg Universities.

dgacmu · on June 13, 2017

Concur. I threw 2x GTX 980s in one of our testbeds two years ago just to make it more generally useful, giving us a 20 GPU cluster. Next time we do it, I'll aim for 4x1080ti or the equivalent. It's painful to power and cool those in lower-end university machine rooms, but it's possible. 10x 4x1080ti machines is only about 15kW, which is high but not insane. It's also only about $60k USD. Also something that a pool of faculty can afford.

I've been experimenting with really low-end builds for that design recently. https://goo.gl/photos/6bDLjJqGAwhG7hGP9

That's a consumer/gaming motherboard (asrock supercarrier), which doesn't have enough PCIe bandwidth to support the cards, but part of what we're researching are ways to reduce synchronization bandwidth. I wouldn't recommend that route as a general approach, though - not flexible enough for future uses. The 8x 1080ti Supermicro build posted a few days ago is probably a better choice: https://news.ycombinator.com/item?id=14508928

The problem is that one student can easily tie up the entire cluster for half the duration of her Ph.D. Machine learning people have voracious appetites for compute. :)

jamesblonde · on June 13, 2017

Nice. We have added support for GPUs to YARN in Hops Hadoop. And users have CPU and GPU quotas in our version of YARN. Jobs are launched as either Tensorflow on Spark or Distributed Tensorflow on YARN. If the user exceeds her quota, currently we allow them to complete. But give them a ticking off. We could also abort jobs on quota violations, but like you alluded to - after 5 days of training, aborting that job would be a world of pain for us.

Seanny123 · on June 13, 2017

tl;dr they found a clever way to spread the training across 256 GPUs by synchronising the stochastic gradient descent

eggie5 · on June 13, 2017

it is a very interesting idea, considering how GD updates are inherently a serial operation. I've always wondered how they pulled it off in spark... I've also read about hogwild which allows parallel SGD on sparse datasets...

randyrand · on June 13, 2017

this seems like the trivial, most obvious way to parallelize training across GPUs. Not, imo, clever.

The important bit here is that they've shown that large mini batch sizes still can maintain accuracy if you slow the learning rate.

opportune · on June 13, 2017

Just because it's simple to explain at a high level doesn't make it trivial.

Plenty of theoretically trivial solutions to problems are absolute pains to implement. I mean there are entire companies that at their core solve relatively "trivial" problems but employ huge numbers of engineers. Just because the core concept is simple to explain doesn't mean it's easy.

randyrand · on June 13, 2017

Someone asked the other day how to parallelize gpu learning. I had never thought about the problem before, but still came up with and gave this as the most obvious way.

https://news.ycombinator.com/item?id=14510146

willvarfar · on June 13, 2017

Lots of clever things are simply self-evident in hindsight.

randyrand · on June 13, 2017

Someone asked 5 days ago on HN how to parallelize gpu learning. I had never thought about the problem before, but still came up with and gave this as the most obvious way. Took me maybe 20 seconds to think of.

https://news.ycombinator.com/item?id=14510146

So the fact that someone with little experience can come up with this 'clever' technique, means either I'm really clever or it's not that clever. I'll go with the latter.

Seanny123 · on June 13, 2017

very personally, clever means "well, that would have taken me a long time to figure out and code properly"

antirez · on June 13, 2017

Trivia: the Pieter in the paper is the one of the Redis fame.

pietern · on June 13, 2017

Hi Salvatore! :D

And there's even a tiny Redis dependency (optional though) in the code to generate these results. In particular the collective communication library needs a rendezvous phase where all nodes connect to their peers. Using Redis for this is one of the options. See: https://github.com/facebookincubator/gloo/tree/master/gloo/r...

antirez · on June 15, 2017

Hey Pieter! Wow cool :-) Thanks for the info. See you soon!

jamesblonde · on June 13, 2017

Some observations:

* for synchronous model-based distributed training to scale linearly, the time required to broadcast the model must be much larger than the time required for a worker (GPU) to process a batch

* it's not strict synchronous training, as when gradients are computed at a worker, they are transmitted to all workers - so the driver doesn't have to send models to all 32 workers at the same time (8 GPUs per worker makes 256 GPUs in total).

* there are extremely large batch sizes (8196)

* it's a good network (50 Gb Ethernet, albeit not infiniband)

So, the relative amount of work done training at each worker is much higher than the time spent broadcasting the model (which is quite small (~100 MB, i think)) to the workers for each iteration. For larger models with smaller batch sizes, this relationship would break down. The interesting contribution here is that you can have massive batch sizes and Facebook provided a heuristic for adjusting the learning rate to converge with such massive batch sizes.

pietern · on June 13, 2017

Re: your second point, it is strictly synchronous, though since there are 8 GPUs per process (thus have 1 process per machine) the gradient reduction is done in 3 phases. First they are reduced within the process, then across processes/machines, and then broadcast within the processes.

jamesblonde · on June 13, 2017

I misphrased that point, agreed. It's not classic driver-driven synchronous training, as you would do in tensorflow. It's using all-reduce (not available in tensorflow yet, i think).

quadrature · on June 13, 2017

relevant paper from facebook https://research.fb.com/publications/ImageNet1kIn1h/

boulos · on June 13, 2017

This is the URL that jonbaer originally submitted (sadly at an awkward time). I meant to send mail about it (there's no "Really! I vouch for this!" for low point stories that languish), but I see the result worked out anyway.

breatheoften · on June 13, 2017

I'd always conceptualized decreasing batch size as a performance/memory optimization to deal with the fact that datasets don't all fit into memory and to reduce overall training time. You look at batch_size samples and compute the sum of the gradient of the errors to update the network weights so as to reduce the error -- shouldn't a larger batch_size inherently provide more information about the optimal direction of the update?

It seems to my naive view like it should be "nice" from an accuracy perspective to look at more samples before making an adjustment to the network weights ...?

In general, does changing the batch_size hyperparameter make a lot of difference on different problems ...? Does the right value for batch size tend to be problem specific and/or network architecture specific?

IanCal · on June 13, 2017

> shouldn't a larger batch_size inherently provide more information about the optimal direction of the update?

Not necessarily, since a batch gradient output (as I understand it, and at least used to code it) all gets averaged together.

Consider standing in a valley with two equal hills either side of you. If you were to try one direction and see that climbing that way helps, you'd take a step that way. Then the next step would keep taking you up that hill.

Now, if you batched together two direction tests, what would happen? You'd average together your left and right and end up with moving nowhere. Having both at the same time doesn't give you better information about how you move if you only see the result after averaging.

This interestingly maps to something we see in humans, though I'm struggling to find a decent paper on it (from the PRISM lab in Birmingham, UK if anyone else has any luck, think the person doing the research might have been called Chris). Simple adaptation tasks, in this case learning to control a joystick that has a clockwise/anticlockwise force applied to it, don't work well if you try and learn both one thing and the opposite straight away. However, sleeping in-between learning each left you able to do both well. Perhaps this was early results though.

Batch tradeoffs:

https://stats.stackexchange.com/questions/164876/tradeoff-ba...

https://arxiv.org/abs/1609.04836

jononor · on June 13, 2017

If this is the case, could one get improved learning by mixing large and small batches?

nullc · on June 13, 2017

[Warning: far from an expert here]

No, batches also help you escape local minima.

sxyuan · on June 13, 2017

[Another non-expert commenting]

Your comment didn't make sense to me at first, but I think I get it now. Even if you were able to fit the entire dataset into memory, batches are still a good idea, because optimizing on the entire dataset is non-convex and will likely lead you into a local minimum. However, what is a local minimum for one batch may not be a local minimum for the next batch, which helps you escape.

This explains why optimization gets harder with very large batch sizes - the gradients for different batches become more similar (as they resemble the "global" gradient more closely), so you become more susceptible to local minima. I think this also explains why the learning rate scaling helps - it increases the variance across gradients, and helps you escape local minima.

breatheoften · on June 13, 2017

This explanation of why larger batches might make optimization harder makes a fair amount of sense to me. This comment on Quora helps a bit as well - with a slightly different approach to characterizing the trade off of batch size/accuracy/runtime performance.

https://www.quora.com/Intuitively-how-does-mini-batch-size-a...

I wonder if rather than computing a single gradient for a large batch you could simultaneously compute a gradient for the batch and for several subsets of the batch -- then pick or combine the gradient subset(s) that most differ from the full batch result. Not sure if that would work out to a computational efficiency gain.

Are there any optimizers that dynamically scale the batch size up/down based on an online metric?

oerpli · on June 13, 2017

One of the better articles I found about this topic when I've been learning for my exam: http://sebastianruder.com/optimizing-gradient-descent/index....

qeternity · on June 13, 2017

I can't seem to find it anywhere but what is the interconnect between servers being used? NVLink is used internally for GPU to GPU communication within a single box...correct? But this sounds like it takes a cluster of 32 of their 8 GPU Big Basin boxes.

pietern · on June 13, 2017

Interconnect between the servers is 50 Gbit Ethernet (see section 4 of the paper).

Ferofluid · on June 13, 2017

Something similar to Infiniband

EvgeniyZh · on June 13, 2017

Here is another paper on large batches https://arxiv.org/abs/1705.08741 It is better written and has more information IMHO

rlv-dan · on June 13, 2017

Does anyone know if there are any "consumer grade" image training kit out there? I'm thinking a software that you can train on your own images to put into categories.

spuz · on June 13, 2017

Yes, you can use Tensorflow and Google's "inception" image recognition model to do this. The model by default is trained on the Imagenet database of images/categories, but Tensorflow allows you to retrain the last layer of the model on your own images to produce your own categorisation. Since you are only retraining the last layer of the model, you can easily do it within about 20 minutes on a laptop. See the tutorial here: https://www.tensorflow.org/tutorials/image_retraining

jonbaer · on June 13, 2017

https://www.clarifai.com/custom-training

personjerry · on June 13, 2017

How much time would it have taken using Torch instead of Caffe2? (I still don't understand which I should use...)

technics256 · on June 13, 2017

It depends on your application. For most general purposes caffe2 will be fine. For research and pushing the limits pytorch is your best bet.

horsecaptin · on June 13, 2017

Looks like Nvidia is doing a proper content marketing push to compete against AMD.

Eridrus · on June 13, 2017

Interestingly enough I think the people who gain the most from this paper are cloud providers. Very few orgs will buy 256 GPUs, but with linear scaling, renting them makes a lot of sense.

shezi · on June 13, 2017

There are "only" 32 GPUs in the cluster, with 8 workers per GPU.

dgacmu · on June 13, 2017

They had 32 servers in the cluster, each with 8 P100 GPUs. Each GPU was one "worker" in their parlance.

("How to train ResNet-50 in one hour on two million dollars of hardware." :-)