1.Larger batches requires large learning rate. And this paper shows that learning rate can even scale linearly with the batch size, which leading to extremely large learning rate/batch sizes.
2.Larger batch causes initial learning difficult, so this paper proposes to have a warm-up period where during the initial epochs, the learning rate grows from a smaller value gradually to a larger one.
But if you are not Google/Facebook/Amazon/Microsoft, the experiment setting is unrealistic to you. Best AWS instances didn't come with the 50GBits network. For now, for the others, we would still stick to at most 8 GPUs on a single node, even your soul screams distributed :/
I disagree that this is not an architecture for others. We are a research lab, and we're building a distributed cluster. Cheap infiniband. Lots of gtx1080Tis. It's not that expensive to have a 40-GPU cluster with 10 4U servers (about 100K Euro).
CPU is not that important.
A motherboard that has PCIe 3.0 and 7 PCIx16 slots is ok for at least 3 GPUs. Each GPU takes typically two slots. Then, a >1400 Watts PSU.
Here is a relatively cheap box that also has 8 disk slots (for hadoop):
https://exxactcorp.com/index.php/solution/solu_detail/320
Some people are using water cooling setups to get 7 GPU's on a 7 PCI slot MB. You need to cut off the DVI that would otherwise occupy another slot, but overall it's not that time consuming and reduces your networking needs.
The approach could prove useful in a single machine too -- for example, in cases where increasing batch size might improve the efficiency of shuffling data back and forth from main memory to GPU memory. This provides an easy recipe for trying larger batch sizes to increase GPU usage.
Concur. I threw 2x GTX 980s in one of our testbeds two years ago just to make it more generally useful, giving us a 20 GPU cluster. Next time we do it, I'll aim for 4x1080ti or the equivalent. It's painful to power and cool those in lower-end university machine rooms, but it's possible. 10x 4x1080ti machines is only about 15kW, which is high but not insane. It's also only about $60k USD. Also something that a pool of faculty can afford.
That's a consumer/gaming motherboard (asrock supercarrier), which doesn't have enough PCIe bandwidth to support the cards, but part of what we're researching are ways to reduce synchronization bandwidth. I wouldn't recommend that route as a general approach, though - not flexible enough for future uses. The 8x 1080ti Supermicro build posted a few days ago is probably a better choice: https://news.ycombinator.com/item?id=14508928
The problem is that one student can easily tie up the entire cluster for half the duration of her Ph.D. Machine learning people have voracious appetites for compute. :)
Nice. We have added support for GPUs to YARN in Hops Hadoop. And users have CPU and GPU quotas in our version of YARN. Jobs are launched as either Tensorflow on Spark or Distributed Tensorflow on YARN. If the user exceeds her quota, currently we allow them to complete. But give them a ticking off. We could also abort jobs on quota violations, but like you alluded to - after 5 days of training, aborting that job would be a world of pain for us.
it is a very interesting idea, considering how GD updates are inherently a serial operation. I've always wondered how they pulled it off in spark... I've also read about hogwild which allows parallel SGD on sparse datasets...
Just because it's simple to explain at a high level doesn't make it trivial.
Plenty of theoretically trivial solutions to problems are absolute pains to implement. I mean there are entire companies that at their core solve relatively "trivial" problems but employ huge numbers of engineers. Just because the core concept is simple to explain doesn't mean it's easy.
Someone asked the other day how to parallelize gpu learning. I had never thought about the problem before, but still came up with and gave this as the most obvious way.
Someone asked 5 days ago on HN how to parallelize gpu learning. I had never thought about the problem before, but still came up with and gave this as the most obvious way. Took me maybe 20 seconds to think of.
So the fact that someone with little experience can come up with this 'clever' technique, means either I'm really clever or it's not that clever. I'll go with the latter.
And there's even a tiny Redis dependency (optional though) in the code to generate these results. In particular the collective communication library needs a rendezvous phase where all nodes connect to their peers. Using Redis for this is one of the options. See: https://github.com/facebookincubator/gloo/tree/master/gloo/r...
* for synchronous model-based distributed training to scale linearly, the time required to broadcast the model must be much larger than the time required for a worker (GPU) to process a batch
* it's not strict synchronous training, as when gradients are computed at a worker, they are transmitted to all workers - so the driver doesn't have to send models to all 32 workers at the same time (8 GPUs per worker makes 256 GPUs in total).
* there are extremely large batch sizes (8196)
* it's a good network (50 Gb Ethernet, albeit not infiniband)
So, the relative amount of work done training at each worker is much higher than the time spent broadcasting the model (which is quite small (~100 MB, i think)) to the workers for each iteration. For larger models with smaller batch sizes, this relationship would break down. The interesting contribution here is that you can have massive batch sizes and Facebook provided a heuristic for adjusting the learning rate to converge with such massive batch sizes.
Re: your second point, it is strictly synchronous, though since there are 8 GPUs per process (thus have 1 process per machine) the gradient reduction is done in 3 phases. First they are reduced within the process, then across processes/machines, and then broadcast within the processes.
I misphrased that point, agreed. It's not classic driver-driven synchronous training, as you would do in tensorflow. It's using all-reduce (not available in tensorflow yet, i think).
This is the URL that jonbaer originally submitted (sadly at an awkward time). I meant to send mail about it (there's no "Really! I vouch for this!" for low point stories that languish), but I see the result worked out anyway.
I'd always conceptualized decreasing batch size as a performance/memory optimization to deal with the fact that datasets don't all fit into memory and to reduce overall training time. You look at batch_size samples and compute the sum of the gradient of the errors to update the network weights so as to reduce the error -- shouldn't a larger batch_size inherently provide more information about the optimal direction of the update?
It seems to my naive view like it should be "nice" from an accuracy perspective to look at more samples before making an adjustment to the network weights ...?
In general, does changing the batch_size hyperparameter make a lot of difference on different problems ...? Does the right value for
batch size tend to be problem specific and/or network architecture specific?
> shouldn't a larger batch_size inherently provide more information about the optimal direction of the update?
Not necessarily, since a batch gradient output (as I understand it, and at least used to code it) all gets averaged together.
Consider standing in a valley with two equal hills either side of you. If you were to try one direction and see that climbing that way helps, you'd take a step that way. Then the next step would keep taking you up that hill.
Now, if you batched together two direction tests, what would happen? You'd average together your left and right and end up with moving nowhere. Having both at the same time doesn't give you better information about how you move if you only see the result after averaging.
This interestingly maps to something we see in humans, though I'm struggling to find a decent paper on it (from the PRISM lab in Birmingham, UK if anyone else has any luck, think the person doing the research might have been called Chris). Simple adaptation tasks, in this case learning to control a joystick that has a clockwise/anticlockwise force applied to it, don't work well if you try and learn both one thing and the opposite straight away. However, sleeping in-between learning each left you able to do both well. Perhaps this was early results though.
Your comment didn't make sense to me at first, but I think I get it now. Even if you were able to fit the entire dataset into memory, batches are still a good idea, because optimizing on the entire dataset is non-convex and will likely lead you into a local minimum. However, what is a local minimum for one batch may not be a local minimum for the next batch, which helps you escape.
This explains why optimization gets harder with very large batch sizes - the gradients for different batches become more similar (as they resemble the "global" gradient more closely), so you become more susceptible to local minima. I think this also explains why the learning rate scaling helps - it increases the variance across gradients, and helps you escape local minima.
This explanation of why larger batches might make optimization harder makes a fair amount of sense to me. This comment on Quora helps a bit as well - with a slightly different approach to characterizing the trade off of batch size/accuracy/runtime performance.
I wonder if rather than computing a single gradient for a large batch you could simultaneously compute a gradient for the batch and for several subsets of the batch -- then pick or combine the gradient subset(s) that most differ from the full batch result. Not sure if that would work out to a computational efficiency gain.
Are there any optimizers that dynamically scale the batch size up/down based on an online metric?
I can't seem to find it anywhere but what is the interconnect between servers being used? NVLink is used internally for GPU to GPU communication within a single box...correct? But this sounds like it takes a cluster of 32 of their 8 GPU Big Basin boxes.
Does anyone know if there are any "consumer grade" image training kit out there? I'm thinking a software that you can train on your own images to put into categories.
Yes, you can use Tensorflow and Google's "inception" image recognition model to do this. The model by default is trained on the Imagenet database of images/categories, but Tensorflow allows you to retrain the last layer of the model on your own images to produce your own categorisation. Since you are only retraining the last layer of the model, you can easily do it within about 20 minutes on a laptop. See the tutorial here: https://www.tensorflow.org/tutorials/image_retraining
Interestingly enough I think the people who gain the most from this paper are cloud providers. Very few orgs will buy 256 GPUs, but with linear scaling, renting them makes a lot of sense.
1.Larger batches requires large learning rate. And this paper shows that learning rate can even scale linearly with the batch size, which leading to extremely large learning rate/batch sizes.
2.Larger batch causes initial learning difficult, so this paper proposes to have a warm-up period where during the initial epochs, the learning rate grows from a smaller value gradually to a larger one.
But if you are not Google/Facebook/Amazon/Microsoft, the experiment setting is unrealistic to you. Best AWS instances didn't come with the 50GBits network. For now, for the others, we would still stick to at most 8 GPUs on a single node, even your soul screams distributed :/