What algorithms can transformers learn? a study in length generalization H Zhou, A Bradley, E Littwin, N Razin, O Saremi, J Susskind, S Bengio, ... arXiv preprint arXiv:2310.16028, 2023 | 85 | 2023 |
Tensor programs iib: Architectural universality of neural tangent kernel training dynamics G Yang, E Littwin International Conference on Machine Learning, 11762-11772, 2021 | 65 | 2021 |
The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon V Thilak, E Littwin, S Zhai, O Saremi, R Paiss, J Susskind arXiv preprint arXiv:2206.04817, 2022 | 47 | 2022 |
Stabilizing transformer training by preventing attention entropy collapse S Zhai, T Likhomanenko, E Littwin, D Busbridge, J Ramapuram, Y Zhang, ... International Conference on Machine Learning, 40770-40803, 2023 | 46 | 2023 |
Biometric authentication techniques DS Prakash, LE Ballard, JV Hauck, F Tang, E Littwin, PKA Vasu, G Littwin, ... US Patent 10,929,515, 2021 | 36 | 2021 |
The multiverse loss for robust transfer learning E Littwin, L Wolf Proceedings of the IEEE Conference on Computer Vision and Pattern …, 2016 | 34 | 2016 |
On infinite-width hypernetworks E Littwin, T Galanti, L Wolf, G Yang Advances in neural information processing systems 33, 13226-13237, 2020 | 30* | 2020 |
Transformers learn through gradual rank increase E Abbe, S Bengio, E Boix-Adsera, E Littwin, J Susskind Advances in Neural Information Processing Systems 36, 2024 | 28 | 2024 |
Tensor programs ivb: Adaptive optimization in the infinite-width limit G Yang, E Littwin arXiv preprint arXiv:2308.01814, 2023 | 18 | 2023 |
The loss surface of residual networks: Ensembles and the role of batch normalization E Littwin, L Wolf arXiv preprint arXiv:1611.02525, 2016 | 15 | 2016 |
Regularizing by the variance of the activations' sample-variances E Littwin, L Wolf Advances in Neural Information Processing Systems 31, 2018 | 12 | 2018 |
When can transformers reason with abstract symbols? E Boix-Adsera, O Saremi, E Abbe, S Bengio, E Littwin, J Susskind arXiv preprint arXiv:2310.09753, 2023 | 10 | 2023 |
Collegial ensembles E Littwin, B Myara, S Sabah, J Susskind, S Zhai, O Golan Advances in Neural Information Processing Systems 33, 18738-18748, 2020 | 9 | 2020 |
Adaptive Optimization in the -Width Limit E Littwin, G Yang The Eleventh International Conference on Learning Representations, 2023 | 7 | 2023 |
Spherical embedding of inlier silhouette dissimilarities E Littwin, H Averbuch-Elor, D Cohen-Or Proceedings of the IEEE Conference on Computer Vision and Pattern …, 2015 | 7 | 2015 |
On random kernels of residual architectures E Littwin, T Galanti, L Wolf Uncertainty in Artificial Intelligence, 897-907, 2021 | 6 | 2021 |
LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures V Thilak, C Huang, O Saremi, L Dinh, H Goh, P Nakkiran, JM Susskind, ... arXiv preprint arXiv:2312.04000, 2023 | 5 | 2023 |
Vanishing gradients in reinforcement finetuning of language models N Razin, H Zhou, O Saremi, V Thilak, A Bradley, P Nakkiran, J Susskind, ... arXiv preprint arXiv:2310.20703, 2023 | 5 | 2023 |
What Algorithms can Transformers Learn H Zhou, A Bradley, E Littwin, N Razin, O Saremi, J Susskind, S Bengio, ... A Study in Length Generalization, 1-39, 2023 | 5 | 2023 |
Learning representation from neural fisher kernel with low-rank approximation R Zhang, S Zhai, E Littwin, J Susskind arXiv preprint arXiv:2202.01944, 2022 | 5 | 2022 |