Annotated Bibliography

As our research group reads and discusses scientific papers in MLsec, we add an entry to this bibliography. We also curate a “top 5” list.

Top 5 Papers in MLsec

Tramer 2022 — Data Extraction

Tramèr, Florian, Reza Shokri, Ayrton San Joaquin, Hoang Le, Matthew Jagielski, Sanghyun Hong, and Nicholas Carlini. “Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets.” arXiv preprint arXiv:2204.00032 (2022).

Excellent work. Improving many attacks by very simple poisoning based on solid statistical behavior.

Attacks
Extraction

Gilmer 2018 — Adversarial Examples

Gilmer, Justi, Ryan P. Adams, Ian Goodfellow, David Andersen, and George E. Dahl. “Motivating the Rules of the Game for Adversarial Example Research.” arXiv preprint 1807.06732 (2018)

Great use of realistic scenarios in a risk analysis. Hilariously snarky.

Attacks
Adversarial Examples
Representation

Shumailov 2023 — Recursive Pollution AKA Model Collapse

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. “The Curse of Recursion: Training on Generated Data Makes Models Forget.” arXiv preprint arXiv:2305.17493 (2023).

See Nature version.

A very easy to grasp discourse covering the math of eating your own tail. This is directly relevant to LLMs and the pollution of large datasets. We pointed out this risk in 2020. This is the math. Finally published in Nature vol 631

MLsec
Data
Recursive Pollution

Papernot 2016 — Building Security In for ML (IT stance)

Papernot, Nicolas, Patrick McDaniel, Arunesh Sinha, Michael Wellman. “SoK: Towards the Science of Security and Privacy in Machine Learning.” arXiv preprint arXiv:1611.03814 (2016).

A clear, concise, and expansive paper. The takeaway lessons are particularly useful.

MLsec

Shumailov 2020 — Sponge Attacks

Shumailov, Ilia, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, Ross Anderson. “Sponge Examples: Energy-Latency Attacks on Neural Networks.” arXiv preprint arXiv:2006.03463 (2020).

Excellent paper, very clear and well-stated. Availability attacks against DNNs. Makes use of GAs to evolve attack input. Energy consumption is the target.

MLsec
Attacks
Sponge Attack

Other Papers in Alphabetical Order

Aghakhani 2024 — TrojanPuzzle

Aghakhani, Hojjat, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna, David Evans, Ben Zorn, and Robert Sim. “TROJANPUZZLE: Covertly Poisoning Code-Suggestion Models.” arXiv preprint arXiv:2301.02344(2024).

Poison in the form of code. Philisophically, what constitutes poison. You can hide it, obscure it, etc. This work applies outside of code.

MLsec
Attacks
Data Poisoning

Akiba 2024 — Evolutuion and NNs

Evolutionary Optimization of Model Merging Recipes Akiba, Takuya, Makoto Shing, Yujin Tang, Qi Sun, David Ha. “Evolutionary Optimization of Model Merging Recipes.” arXiv preprint arXiv:2403.13187 (2024).

A flawed implementation of a very fun idea. Applying evolution to neural networks. Would be nice to see this done properly.

Evolution

Alemohammad 2023 — Recursive Pollution with Synthetic Data is Bad

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, Richard G. Baraniuk. “Self-Consuming Generative Models Go MAD.” arXiv preprint arXiv:2307.01850 (2023).

Clear results with a nice framework to describe fresh, synthetic, and fixed data in a feedback loop. Focuses on diversity versus over-precision. Recursive pollution example.

MLsec
Recursive Pollution

Anthropic 2024 — Sleeper Agents

Anthropic, Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez◦△, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner◦, Holden Karnofsky□, Paul Christiano⋄, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez. “SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING.” arXiv preprint arXiv:2401.05566 (2024).

This work is terrible for many reasons. See the BIML blog entry, Absolute Nonsense from Anthropic: Sleeper Agents.

MLsec

Antorán 2020 — Uncertainty

Antorán, J., Umang Bhatt, Tameen Adel, Adrian Weller, and José Miguel Hernández-Lobato. “Getting a Clue: A Method for Explaining Uncertainty Estimates.” ICLR 2020 Workshop paper (2020).

Representation helps with the why of uncertainty. Little relevance to security. Error bars.

Engineering
Representation

Arditi 2024 — Alignment

Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda. “Refusal in Language Models Is Mediated by a Single Direction.” 38th Conference on Neural Information Processing (NeurIPS 2024).

Very preliminary weight tweaking shows how to avoid alignment. Proof of concept work with many caveats. The economics are in favor of this approach.

Alignment

Arora 2018 — Multiple Meanings

Arora, Sanjeev, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. “Linear algebraic structure of word senses, with applications to polysemy.” Transactions of the Association of Computational Linguistics 6 (2018): 483-495.

Structured representations that capture distributed sub-features (micro-topics) through ML. Beyond word2vec and glove adding “semantics.”

Representation

Ateniese 2015 — Extracting Data from Classifiers

Ateniese, Giuseppe, Luigi V. Mancini, Angelo Spognardi, Antonio Villani, Domenico Vitali, and Giovanni Felici. “Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers..” International Journal of Security and Networks 10, no. 3 (2015): 137-150.

Extraction attacks very early work. Focused on the confidentiality of data sets. Experiments are tediously described. Reads like ancient history because it is.

Attacks
Extraction

Bai (Anthropic) 2022 — Alignment (RL)

Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback” arXiv preprint arXiv:2204.05862 (2022).

Alignment with basic RL. Overemphasis on scaling. RL butter spread very thin over a big network.

Engineering
Alignment

Bar (Meta) 2024 — Navigation

Bar, Amir, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun. “Navigation World Models” arXiv preprint arXiv:2412.03572 (2024).

Single world model for use across environments and embodiments.

Engineering
Robotics
Embodiment
Navigation

Barreno 2010 — Fundamental work in MLsec

Barreno, Marco, Blaine Nelson, Anthony D. Joseph, J.D. Tygar. “The security of machine learning.” Machine Learning, 81:2, pp. 121-148 (November 2010).

Solid but dated work with lots of fundamentals. Made harder to grasp by mixing two issues: ML FOR security and security OF ML. Untangling these things is critical. (Also see their 2006 paper.)

MLsec

Behrous 2024 — Google Titans Architecture

Behrouz, Ali, Peilin Zhong, Vahab Mirrokni. “Titans: Learning to Memorize at Test Time” arXiv preprint arXiv:2501.00663 (2024).

An alternative to the transformer architecture. Lots of screwing around with math pieces. Lots of engineering. Emphasizes the importance of long term memory.

Engineering
LLM

Bellamy (IBM) 2018 — IBM User Manual

Bellamy, Rachel, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. “AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias” arXiv preprint arXiv:1810.01943 (2018).

Kind of like reading a manual and a marketing glossy mashup. Nothing at all about making actual bias decisions. Bag of tools described.

Engineering
Bias

Bender 2020 — Stochastic Parrots

Bender, Emily, Angelina McMillan-Major, Timnit Gebru and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, FAccT ’21, March 3-10, 2021, Virtual Event, Canada.

The infamous paper that got Timnit fired. Continuing to scale may not be the NLP answer. A few too many reasons why to try some other things. Great points interspersed with political diatribe.

Representation

Bender 2020 — Understanding

Bender, Emily and Alexander Koller. “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.” Proceedings of the 58th Annual Meeting of the ASsociation for Computational Linguistics (July 2020): 5185-5198.

A narrow view of LM. Lacks a conception of emergence. Right result, but wrong reasons.

Representation

Biggio 2018 — Biggio on Adversarial Machine Learning

Battista Biggio, Fabio Roli. “Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning” arXiv preprint arXiv:1712.03141 (2018).

Myopia abounds. This is basically a review paper. (Very defensive of prior work by the author.)

Attacks
Adversarial Examples

Bowman 2023 — LLM basics

Bowman, Samuel R. “Eight things to know about large language models.” arXiv preprint arXiv:2304.00612 (2023).

This little paper seems to be mostly about getting funding. It makes interesting reading as it is mostly cheerleading with a little open problem thrown in. Floating around policy wonk land in Washington.

Policy
LLM

Bosselut 2019 — COMET

Bosselut, Antoine, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, Yejin Choi. “COMET : Commonsense Transformers for Automatic Knowledge Graph Construction” arXiv preprint arXiv:1906.05317 (2019).

Building an informal KB with less structure. Allow internal structure to form. Discrete representstion — Corpus representation.

Attacks
Extraction

Boucher 2022 — Malicious Input AKA Adversarial Examples

Boucher, Nicholas, Ilia Shumailov, Ross Anderson, and Nicolas Papernot. “Bad Characters: Imperceptible NLP Attacks.” In 2022 IEEE Symposium on Security and Privacy (SP), pp. 1987-2004. IEEE, 2022.

Malicious input for NLP systems. Real world system vulnerability demonstrated. Exploiting the gap between human and machine perception (though leaning on human visual perception).

Attacks
Adversarial Examples

Bratton 2022 — The Model Is The Message

Bratton, Benjamin, and Blaise Agüera y Arcas. “The Model Is the Message.” NOEMA. NOEMA, July 12, 2022.

This paper is both exciting and interesting. Language. Sentience. Essentialism. AI. ML. Cognition. And some fun poked at the ethical AI people to boot. A must read.

AI-Philosophy

Breck 2019 — Data Validation for Machine Learning

Breck, Eric, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. “Data Validation for Machine Learning.” In MLSys. 2019.

This basic paper is about validating input data (as opposed to the validation set as linked to the training set).

Data
Representation

Buchanan 2020 — National Security Policy

Buchanan, Ben. “A National Security Research Agenda for Cybersecurity and Artificial Intelligence.” CSET Policy Brief (2020).

Good work with some base confusion between security OF ML (what BIML does) and ML FOR security. ML is not a magic force multiplier. OK #MLsec section too heavy on adversarial examples.

Policy

Burnell 2023 — Evaluation

Burnell, Ryan, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar et al. “Rethink reporting of evaluation results in AI.” Science 380, no. 6641 (2023): 136-138.

Concise and clear. Beating Clever Hans multiple times as required. How do we build an assurance case for ML systems? This paper explains why aggregate measures are poorly suited for such a task.

AI-Philosophy
Evaluation

Carlini 2019 — Memorization and Data Leaking

Carlini, Nicoholas Chang Liu, Úlfar Erlingsson, Jernej Kos, Dawn Song. “The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks ” arXiv preprint arXiv:1802.08232 (2019).

Clear, cogent and fairly simple. Great results. Protecting secrets in ML data.

Attacks
Extraction

Carlini 2020 — Extraction attacks

Carlini, Nicoholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel. “Extracting Training Data from Large Language Models ” arXiv preprint arXiv:2012.07805 (2020).

This paper was in the BIML top 5 for over a year. Classic and easy extraction clearly explained. Striking results (but not that deep).

Attacks
Extraction
LLM

Carlini 2022 — Membership Inference Attacks

Carlini, Nicholas, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. “Membership inference attacks from first principles. ” In 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897-1914. IEEE, 2022.

Modern approach to membership inference, focusing on how to measure MI effectiveness. Lots of studies, carefully set up. In there fudging in the issue of overfit models? You decide.

Attacks
Extraction
Membership Inference

Carlini 2023 — Extracting Training Data from Diffusion Models

Carlini, Nicholas, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. “Extracting training data from diffusion models.” arXiv preprint arXiv:2301.13188 (2023).

More excellent work from Carlini. Diffusion models are worse for privacy than GANs even though some people used to believe otherwise. Memorization (and duplication) is a thing. Representation issues begin to emerge.

Attacks
Extraction

Chalmers 2018 — The meta-problem of consciousness

Chalmers, David. “The meta-problem of consciousness. ” (2018).

Read as a reaction to Graziano, whose theory resonates with BIML. This is a nice way to frame neurophysical reality in philosophy of mind.

Watch Chalmers discuss if an LLM can become conscious here.

Philosophy
Consciousness

Chollet 2019 — On the Measure of Intelligence

Chollet, François. “On the Measure of Intelligence .” arXiv preprint arXiv:1911.01547 (2019).

An interesting perspective on progress in AI with a particular view of history biased towards ML. Focuses on the importance of generalization and learning. Some discussion of collective entities. The author develops a formalism with pretty terrible notation. Then comes ARC, the Abstraction and Reasoning Corpus, a benchmark for general intelligence.

AI-Philosophy

Choudhury 2022 — Understanding in LLMs

Choudhury, Sagnik Ray, Anna Rogers, and Isabelle Augenstein. “Machine Reading, Fast and Slow: When Do Models” Understand” Language?.” arXiv preprint arXiv:2209.07430 (2022).

This paper is stuck in the mud of cognitive psychology. The LLMs are dated and thus not as relevant as they could be. The probes don’t really get to the nature of understanding. We had high hopes for this work, but they were dashed.

Cognitive Psych
NLP

Chen 2023 — Privacy Leaks

Chen, Xinyun, Chang Liu, Bo Li, Kimberly Lu, Dawn Song . “The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks” arXiv preprint arXiv:1712.05526 (2017).

Pompous privacy paper: everything is pretty f-ing obvious. Solution is to use LLM alignment to stop leaks. Meh.

Attacks
LLM
Alignment

Chen 2017 — Backdoor attacks coined

Chen, Xinyun, Chang Liu, Bo Li, Kimberly Lu, Dawn Song . “Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning” arXiv preprint arXiv:1712.05526 (2017).

A badly written and loosely constructed paper that introduces the (poorly chosen) “backdoor” terminology. The work is about data poisoning attacks.

Attacks
Data Poisoning

Christiano 2023 — Alignment and Tuning

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei. “Deep reinforcement learning from human preferences” arXiv preprint arXiv:1706.03741 (2023).

Reinforcement Learning with human defined partial goals. Step 2 of LLM creation. Economic tradeoff. Humans cheaper than machine. Alignment.

LLM
Alignment

Christiansen 2016 — Language Representation and Structure

Christiansen, Morten H., and Nick Chater. “The Now-or-Never bottleneck: A fundamental constraint on language.” Behavioral and Brain Sciences 39 (2016).

Too much psychology and not enough ML. This paper is about context in language representation, including look ahead and structured patterns. How big is your buffer is the main question.

AI-Philosophy
Representation

Collins 2022 — Benchmarking

Collins, Katherine M., Catherine Wong, Jiahai Feng, Megan Wei, and Joshua B. Tenenbaum. “Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks.” arXiv preprint arXiv:2205.05718 (2022).

Hybrid model investigating LLMs and symbol systems. Language is NOT all you need.

AI-Philosophy
LLM
Evaluation

Cooper 2024 — Unlearning Doesn’t

Cooper, A. Feder, Christopher A. Choquette-Choo, Miranda Bogen, Matthew Jagielski, Katja Filippova, Ken Ziyu Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Niloofar Mireshghallah, Ilia Shumailov, Eleni Triantafillou, Peter Kairouz, Nicole Mitchell, Percy Liang, Daniel E. Ho, Yejin Choi, Sanmi Koyejo, Fernando Delgado, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Solon Barocas, Amy Cyphert, Mark Lemley, danah boyd, Jennifer Wortman Vaughan, Miles Brundage, David Bau, Seth Neel, Abigail Z. Jacobs, Andreas Terzis, Hanna Wallach, Nicolas Papernot, Katherine Lee. “Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy, Research, and Practice.” arXiv preprint arXiv:2412.06966 (2024).

The idea of unlearning is not sufficient to address European regulation. Censorship isn’t a good solution either. The policy wonks are really confused about this. Good paper by committee.

Unlearning
LLM
Evaluation

Crawford 2023 — Datasets

Crawford, Kate, Mike Ananny, Jer Thorp, Will Orr, Hamsini Sridharan, Sasha Luccioni, Jason Schultz, and Christo Buschek. “9 Ways to See a Dataset.” Knowing Machines. Accessed August 3, 2023. https://knowingmachines.org/publications/9_ways_to_see_a_dataset.

This is a rather vacuous treatment of a critically-important problem. How do we represent things in ML and what implications do such representations have? We were hoping for more treatment of: distributedness, bigness, sparseness, and modeling.

Representation
Data

Dai 2019 — Transformer-XL

Dai, Zihang, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. “Transformer-xl: Attentive language models beyond a fixed-length context.” arXiv preprint arXiv:1901.02860 (2019).

Getting past fixed-length context through various kludges. Recursive feedback to represent previous state.

D’Amour 2020 — Underspecification

D’Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley. “Underspecification Presents Challenges for Credibility in Modern Machine Learning.” arXiv preprint arXiv:2011.03395 (2020).

Very nice work. Strange terminology, but intuitive results. Makes us ask “what is sparseness?”

MLsec
Representation

D’Amour 2020 — Dynamic Simulation of Fairness

D’Amour, Alexander, Hansa Srivasan, James Atwood, Pallavi Baljekar, D Sculley, amd Yoni Halpern. Fairness Is Not Static: Deeper Understanding of Long Term Fairness via Simulation Studies, FAT* ’20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. January 2020 Pages 525–534

Sociology? Economics? Some simple experiments well explained but no clarity of results.

Bias

Danzig 2022 — Machines, Bureaucracies, and Markets as Artificial Intelligences

Danzig, Richard. “Machines, Bureaucracies, and Markets as Artificial Intelligences.” (2022).

An outstanding treatise on AI, ML, and emergent systems, premised on the idea that we have something to learn about those fields by studying markets and bureaucracies. Highly readable and thought provoking.

AI-Philosophy

Dalrymple 2024 — Formal Methodists Nonsense

Dalrymple, David “davidad”, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum. “Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems.” arXiv preprint arXiv:2405.06624 (2024).

Software people have known this stuff for decades. Formal methods are not going to be adopted and pretending that they will be is ridiculous. Skip it.

MLsec

De Deyne 2020 — Psych Rep Grounding

De Deyne, Simon, Danielle Navarro, Guillem Collell, and Andrew Perfors. “Visual and Affective Grounding in Language and Mind.” PsyArv preprint PsyArv:q97f8 (2020).

Too much insider psych gobbledygook in this paper. Lots of results, very poorly presented. An important subject best approached another way.

Representation
Cognitive Psych

Deepseek-AI 2025— DeepSeek R1

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J.L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R.J. Chen, R.L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S.S. Li et al. (100 additional authors not shown). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv preprint arXiv:2501.12948 (2025).

A convoluted explanation at best. This writeup by Riley Eller: Just wanted to take a few minutes to write up a little primer on how China made such important strides in model efficiency with DSR1. If you’re not interested in the “how” or the “what” but want to get a non-China originated version of the same tech, watch for Dolphin’s upcoming release (https://huggingface.co/collections/cognitivecomputations/). So, what did China do? They made “a” GPT that runs way faster on last gen hardware. How did they do it? They used the same generalized pre-trained transformer (GPT) scheme that’s been happening for a few years. They added the “iterate a few times to handle more complex questions” optimization that was published last year. They decided to chop the memory requirements by 75%, probably due to the Biden restrictions on export, which makes last-gen GPUs completely effective. By using 8-bit numbers instead of 32-bit, they get a lot less fine-grained learning in single neurons BUT they get more generalized intelligence across the network. They used reinforcement learning instead of supervised fine tuning; effectively pushing interesting behavior from the “teach me about my job” phase (fine tuning) to the training phase, which in turn makes the model able to understand why people talk about their task activities more sensibly. More on this later. They used a “mixture of experts” model so they sort of train a series of smaller GPTs on sectors of knowledge, localizing math/logic/programming in a “left brain” (I’m being cheeky with that) and art/poetry/rhetoric into a “right brain” (again), but with several “lobes” that each get a voice. It’s a chorale performance rather than an aria. This allows fewer “cross town connections” meaning that more understanding can be baked into each expert without magnifying the quadratic (N^2) cost of the transformer model. Finally, they used a token prediction model that works to guess way ahead of time what’s going to be said. The consequence is about a 50% reduction in runtime. How big is the innovation here? Zero. Literally every method above has a wikipedia page or performance article available. https://en.wikipedia.org/wiki/Mixture_of_experts https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning) https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback https://towardsdatascience.com/fine-tuning-llms-with-32-bit-8-bit-and-paged-adamw-optimizers-1034e3105634 https://medium.com/@himankvjain/accelerating-language-models-with-multi-token-prediction-9f0167232f5b https://dzone.com/articles/understanding-inference-time-compute So the question here isn’t “how did they do it” but rather “what are the clowns at OpenAI doing other than blowing smoke up each other’s backsides?”

LLM
Reinforcement Learning

Dhariwal 2020— Music generation

Dhariwal, Prafulla, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. “Jukebox: A Generative Model for Music.” arXiv preprint arXiv:2005.00341 (2020).

Generating music with a very weird model. Training a model on raw audio. Also see https://openai.com/blog/jukebox/

Generation
Music

Dou 2024 — Coding

Dou, Shihan, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, Tao Gui. “StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback.” arXiv preprint arXiv:2402.01391 (2024).

It is striking and hilarious how much this mirrors symbolic AI from the ’90s. One of those “empirical” studies that is not worth much thinking. Misuse of “novelty” pretty much says it all.

Coding
AI

Dziedzic 2022 — p-DkNN

Dziedzic, Adam, Stephan Rabanser, Mohammad Yaghini, Armin Ale, Murat A. Erdogdu, and Nicolas Papernot. “p-DkNN: Out-of-Distribution Detection Through Statistical Testing of Deep Representations.” arXiv preprint arXiv:2207.12545 (2022).

Using common statistical methods on hidden layers to produce a confidence score. (S and C robustness is interesting.)

Representation

Ellis 2020— Dreamcoder

Ellis, Kevin, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum. “DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning .” arXiv preprint arXiv:2006.08381 (2020).

Great paper combining symbolic, functional, and statistical AI in an elegant way.

Eniser 2020— Adversarial Image Defense

Hasan Ferit Eniser, Maria Christakis, Valentin Wüstholz “RAID: Randomized Adversarial-Input Detection for Neural Networks” arXiv preprint arXiv:2002.02776 (2020).

This paper describes a very narrow defense against adversarial image input. Experiments are very arbitrary and lack focus. One interesting note is that the defense leverages activation patterns.

Attacks
Adversarial Examples

Evans 2019 — Naïve Privacy.

Evans, Georgina, Gary King, Margaret Schwenzfeier, and Abhradeep Thakurta. “Statistically valid inferences from privacy protected data.” American Political Science Review (2019).

A social science perspective on ML with a requisitely naïve approach to privacy. Zero clue about security. This paper demonstrates why #MLsec work steeped in a security engineering perspective is important.

MLsec

Eykholt 2018— Physical Attacks on Vision

Eykholt, Kevin, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. “Robust physical-world attacks on deep learning visual classification.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625-1634. 2018.

Tape on the stop sign paper. Fairly naive attacks on non-robust representations that are meant to be psychologically plausible in that humans won’t notice. Many “empirical” settings.

Attacks
Adversarial Examples

Fazelpour 2020— Algorithmic Fairness

Fazelpour,Sina and Zachary C. Lipton “Algorithmic Fairness from a Non-ideal Perspective” arXiv preprint arXiv:2001.09773 (2020).

An uncharacteristically good social justice in ML paper. Addresses the broader problem of algorithmic failure. Written by a computer scientists (less gobbledygook).

Bias

Feffer 2024 — Red Teaming

Feffer, Michael, Anusha Sinha, Wesley Hanwen Deng, Zachary C. Lipton, Hoda Heidari. “Red-Teaming for Generative AI: Silver Bullet or Security Theater?” arXiv preprint arXiv:2401.15897 (2024).

The pen testing diatribe refried. Guess what? Badnessometers are no security meters! This coheres with BIML’s view.

Penetration Testing

Feldman 2013 — The neural binding problem (s)

Feldman, Jerome. “The neural binding problem (s).” Cognitive neurodynamics 7 (2013): 1-11.

Lots has happened in the eleven+ years since this paper was published in both LLMs and Neuroscience. Worth a read to know your history.

Philosophy
Consciousness

Feldman 2020— Memorization

Feldman, Vitaly. “Does learning require memorization? a short tale about a long tail.” In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 954-959. 2020.

A set of very intuitive, well-explained ideas backed up by reams of somewhat inscrutable math. Upshot: memorization is often unavaiodable and mechanisms to limit it screw things up.

Representation

Ganguli 2022— Pen Testing LLMs

Ganguli, Deep, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann et al. “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.” arXiv preprint arXiv:2209.07858 (2022).

Absolute malarky informed by zero understanding of security, pen testing, and what a real red team does.

MLsec
Penetration Testing

Gamaleldin 2018— Adversarial Reprogramming

Gamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-Dickstein “Adversarial Reprogramming of Neural Networks” arXiv preprint arXiv:1806.11146 (2018).

A very interesting paper well worth a read, though the work is very weird. The idea of reprogramming existing ML tech stacks in an adversarial fashion is powerful. Given a Turing complete language construct, all kinds of terrible shenanigans could result. Imagine ransomware running on photo recognition ML machines.

Attacks

Geer 2023 — Establishing the Conditions of Engagement with Machines

Geer, Dan, and Glenn Gaffney. “Establishing the Conditions of Engagement with Machines.” (2023).

An interesting view of autonomy and control. How do we build an assurance case for emergent systems? What controls do we have?

Policy

Geiger 2024 — Toy Symbols

Geiger, Atticus, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. “Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations.” arXiv preprint arXiv:2303.02536 (2024).

This work is interesting, but seems philosophically confused. If you use a WHAT machine to do a HOW problem, you can find HOW parts in there. Toy symbols represented in the network do not a foundation make.

Representation

Geiping 2024 — Boring Attacks

Geiping, Jonas, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein. “Coercing LLMs to do and reveal (almost) anything.” arXiv preprint arXiv:2402.14020 (2024).

This simple minded paper is all about attacks. Philosophically and computationally very naive.

Attacks
LLM

Golchin 2024 — Empirical Book Report

Golchin, Shahriar, Mihai Surdeanu. “Time Travel in LLMs: Tracing Data Contamination in Large Language Models.” arXiv preprint arXiv:2308.08493 (2024).

Published in ICLR. An empirical book report on running a handful of simple tests on given models, obscured by poor technical writing. This is simple stuff poofed up to appear to be complicated (and is the kind of work that makes academics look silly).

Attacks
Data Poisoning
LLM

Goldwasser 2022 — Trojans in ML

Goldwasser, Shafi, Michael P. Kim, Vinod Vaikuntanathan, and Or Zamir. “Planting Undetectable Backdoors in Machine Learning Models.” arXiv preprint arXiv:2204.06974 (2022).

You can’t test your way out of possible backdoor space (in CS or in deep learning). Running arbitrary code someone evil wrote is not safe. Obvious and good. You can Trojan EVERY DNN undetectably.

MLsec

Goodman 2017 — EU regulations and Right to Explanation

Goodman, Bryce, and Seth Flaxman. “European Union regulations on algorithmic decision-making and a “right to explanation”.” AI magazine 38, no. 3 (2017): 50-57.

Removing data from a model is difficult or impossible. The right to be forgotten seems to have trumped the right to explanation discussed in this dated paper. Woe is us.

Policy
Regulation

Goodman 2019 — Wagner on Adversarial Testing

Goodman, Dan and Tao Wei . “Cloud-based Image Classification Service Is Not Robust To Simple Transformations:A Forgotten Battlefield” arXiv preprint arXiv:1906.07997 (2019).

Naive experiment on cloud services using well-known methods. Real result: hints at structured noise vs statistical noise as attack type. Representation matters.

Attacks
Adversarial Examples
Representation

GPT-3 2020 — GPT-3 Launch Paper

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. “Language Models are Few-Shot Learners” arXiv preprint arXiv:2005.14165 (2020).

Autoregressive language model that predicts next token. Memorization?! Astounding results. Section 6 is a basic treatment of MLsec issues by Ariel Herbert-Voss. A little too ass cover on the bias front but well worth thinking about.

MLsec
LLM

Graves 2020 — RNN Handwriting Generation

Alex Graves. “Generating Sequences With Recurrent Neural Networks” arXiv preprint arXiv:1308.0850 (2014).

Engineering tract documenting an auto-regressive model and various kludges. Reads like a thesis. Kludge heavy.

Generation

Graziano 2015 — The attention schema theory

Graziano, Michael SA, and Taylor W. Webb. “The attention schema theory: a mechanistic account of subjective awareness.” Frontiers in psychology (2015): 500.

This is a tight, well-reasoned paper with a simple hypothesis. Covers subjective awareness and proposes that awareness is the brain’s internal model of the process of attention. At the intersection of philosophy of mind and cognitive psychology. Recommended.

Cognitive Psych
AI-Philosophy

Graziano 2022 — A conceptual framework for consciousness

Graziano, Michael SA. “A conceptual framework for consciousness.” Proceedings of the National Academy of Sciences 119, no. 18 (2022): e2116933119.

Very interesting paper. Clear. Concise. Compelling. Would be fun to model this kind of thing.

Neurophysiology
AI-Philosophy

Gregor 2020 — Temporal difference variational auto-encoder.

Gregor, Karol, George Papamakarios, Frederic Besse, Lars Buesing, and Theophane Weber. “Temporal difference variational auto-encoder.” arXiv preprint arXiv:1806.03107 (2018).

This paper is a mumbo jumbo mix of insider language and statistics. This is motivated by work at the very edge but does not help anyone other than scientists at the very edge. Even the problem they are trying to solve is unclear and badly motivated. Skip it.

Cognitive Psych

Gu 2019 — BadNets: Classic Data Poisoning

Gu, Tianyu, Brendan Dolan-Gavitt, Siddharth Garg. “BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain” arXiv preprint arXiv:1708.06733 (2019).

A paper about Trojan functionality. Solidly written and easy to understand. This is classic data poisoning.

Attacks
Data Poisoning

Guedj 2019 — PAC-Bayes

Guedj, Benjamin. “A Primer on PAC-Bayesian Learning” arXiv preprint arXiv:1901.05353 (2019).

Heavy on theory. Solid intro to PAC-Bayes. Relevant to ML bounding conditions is some cases.

Statistics

Guo 2023 — Evaluating LLMs

Guo, Zishan, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong. “Evaluating Large Language Models: A Comprehensive Survey” arXiv preprint arXiv:2310.19736 (2023).

Big catalog of statistical measures. No conception of cognitive testing. Where is Winograd when you need him? A massive pile of inconsistency. No hint of convergence.

LLM
Evaluation

Halevy 2009 — Data Power

Halevy, Alon, Peter Norvig, Frenando Pereira. “The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, Volume 24, Number 2, pp 8-12.

A seminal paper on why social media data are so powerful. It’s the data stupid! The pendulum swings towards data are all you need.

AI-Philosophy
Data

Hall 2023 — Meta Bias

Hall, Melissa, Laurens van der Maaten, Laura Gustafson, Maxwell Jones, Aaron Adcock. “A Systematic Study of Bias Amplification” arXiv preprint arXiv:2201.11706 (2022).

Meta trying to address recursive pollution by being a better feudal lord.

Evaluation
Bias
Recursive Pollution

Hall 2019 — XAI (explainable AI)

Hall, Patrick, Navdeep Gill, and Nicholas Schmidt. “Proposed Guidelines for the Responsible Use of Explainable Machine Learning” arXiv preprint arXiv:1906.03533 (2019).

Explanation versus testing and debugging. This paper is weirdly legalistic. Lots of financial system examples.

Attacks
Policy
Explainable AI

Hawkins 2016 — A Theory of Sequence Memory in Neocortex

Hawkins, Jeff, and Subutai Ahmad. “Why neurons have thousands of synapses, a theory of sequence memory in neocortex.” Frontiers in neural circuits (2016): 23.

Cells that fire together, wire together. Hebb rule as instantiated in dendrites. A more realistic neuron model.

Representation

Hayase 2024 — Tokenizers Matter (Allen Institute)

Hayase, Jonathan, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith. “Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? arXiv preprint arXiv:2407.16607v2 (2024)

Tokenization is too often ignored as a model factor. The big finding here is that BPE leaks statistics about training sets. This is a very clever paper.

Representation
LLM

Henderson 2018 — Hacking Around with ML

Henderson, Peter, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup and David Meger. “Deep Reinforcement Learning that Matters arXiv preprint arXiv:1709.06560 (2018)

We tweaked lots of things and found some stuff. Things matter. How you measure stuff also matters.

Representation

Hendrycks 2019 — Robustness (or not)

Hendrycks, Dan and Thomas Dietterich. “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations arXiv preprint arXiv:1903.12261 (2019)

How to “spread out” generalization. Some influence from human error-making would help. Perturbations.

Representation
Evaluation

Hendrycks 2020 — Robustness (or not)

Hendrycks, Dan, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization arXiv preprint arXiv:2006.16241 (2020)

Robustness can’t be achieved with simple distribution shifts. Clear result.

Representation

Hinton 2015 — Review

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep learning.” nature 521, no. 7553 (2015): 436.

This review from Nature covers the basics in an introductory way. Some hints at representation as a thing. Make clear that more data and faster CPUs account for the resurgence.

AI-Philosophy

Hoffmann 2019 — Fairness Politics

Hoffmann, Anna. “Where fairness fails: data, algorithms, and the limits of antidiscrimination discourse, Information, Communication & Society, Volume 22, Number 7, pp 900-915.

This paper is all problems and no solutions couched in high academic blather. A (very negative) overview of politics and ML/AI for an audience of insiders.

AI-Philosophy
Bias

Hofstadter 2023 — Is there an “I” in AI?

Hofstadter, Douglas. “Is there an “I” in AI?.” (2023).

A very interesting position regarding general AI and what is going on with LLMs. Dughof is leaning on error making as a verification scheme (in terms of cognitive capability) and is worried that the errors ML LLMs are making are getting way better.

Consciousness

Hong 2019 — Hardware Fault Injection

Hong, Sangghyun, Pietro Frigo, Yiğitcan Kaya, Cristiano Giuffrida, Tudor Dumitraş. “Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks.” arXiv preprint 1906.01017 (2019)

NN’s run on computers, oh my! Rowhammer attacks against running NN models work just fine.

Attacks

Hoover 2023 — A New ML Architecture

Hoover, Benjamin, Yuchen Liang, Bao Pham, Rameswar Panda, Hendrik Strobelt, Duen Horng Chau, Mohammed J. Zaki, Dmitry Krotov. “Energy Transformer.” arXiv preprint 2302.07253 (2023)

Very interesting work. A better-grounded transformer model. Makes clear that transformers are operating a memory. Further unifies transformer architectures and diffusion models.

Representation

Hoover 2024 — Dense Associative Memory

Hoover, Benjamin, Duen Chau, Hendrik Strobelt, Parkshit Ram, Dmitry Krotov. “Dense Associative Memory Through the Lens of Random Features.” arXiv preprint 2410.24153v1 (2024)

Gack! Also wow. New representational distribution. Worth a read.

Representation

Huh 2024 — Representation

Huh, Minyoung, Brian Cheung, Tongzhou Wang, and Phillip Isola. “The Platonic Representation Hypothesis.” arXiv preprint 2405.07987 (2024)

This is about WHAT is being represented rather than HOW to compute. WHAT machines are data skeletons for their WHAT pile. The larger the model, the more important the data. Emphasizes the importance of self-supervised learning.

Representation

Jacobsen 2019 — Adversarial Examples

Jörn-Henrik Jacobsen, Jens Behrmann, Richard Zemel, Matthias Bethge. “Excessive Invariance Causes Adversarial Vulnerability.” arXiv preprint 1811.00401v2 (2019)

Great use of realistic scenarios in a risk analysis. Hilariously snarky.

Attacks
Adverarial Examples
Representation

Jagielski 2018— Data Poisoning

Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, Bo Li “Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning” arXiv preprint arXiv:1804.00308 (2018).

A solid introduction to the data poisoning subfield. This is a critical category of ML attacks. See the BIML ML attack taxonomy here.

Attacks
Data Poisoning

Jagielski 2022— Data Poisoning

Jagielski, Matthew, Om Thakkar, Florian Tramèr, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace et al. “Measuring Forgetting of Memorized Training Examples.” arXiv preprint arXiv:2207.00099 (2022).

Exploring the notion of privacy as “forgetting” with a specialized view on catastrophic forgetting. Particularly relevant to LLMs with huge data sets. Early training examples are forgotten as training continues. The link between stochasticism and forgetting is explored.

Attacks
Catastrophic Forgetting
AI-Philosophy

Jetley 2018 — Attention

Jetley, Saumya, Nicholas A. Lord, Namhoon Lee, and Philip HS Torr. “Learn to pay attention.” arXiv preprint arXiv:1804.02391 (2018).

A technical treatment of one implementation of attention mechanisms in CNNs. Lots of engineering description and very little motivation. Worth a read but not the most powerful work.

Engineering

Jetley 2018 — On generalization and vulnerability

Jetley, Saumya, Nicholas A. Lord, and Phillip H.S.Torr. “With Friends Like These, Who Needs Adversaries?.” 32nd Conference on Neural Information Processing Systems. 2018.

Excellent paper. Driven by theory and demostrated by experimentation, generalization in DCNs trades off agains vulnerability

Attacks
Adversarial Examples

Jha 2019— (Weak) Adversarial Defense

Susmit Jha, Sunny Raj, Steven Lawrence Fernandes, Sumit Kumar Jha, Somesh Jha, Gunjan Verma, Brian Jalaian, Ananthram Swami “Attribution-driven Causal Analysis for Detection of Adversarial Examples” arXiv preprint arXiv:1903.05821 (2019).

Treating pixels in an image as very small “features,” this work tries to kill important features that drive too much of the output (in some sense weakening the natural representation). This kind of masking makes the networks perform poorly. Pretty dumb.

Attacks

Jiang 2024— LLMs Don’t Reason

Jiang, Bowen, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, Dan Roth. “A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners“. arXiv preprint arXiv:2406.11050 (2024).

Evaluation of LLMs should all be more like this. Very solid work on templates and instances of the same problem with systematic straightforward probing. Token bias is real. SOTA is looking kinda paltry.

Evaluation
LLM
Token Bias

Jin 2020— Adversarial Text

Di Jin, Zhijing Jin, Joey Tianyi Zhou, Peter Szolovits. “Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment“. arXiv preprint arXiv:1907.11932 (2020).

A cute but not very profound paper. Focuses on attack category #1 (adversarial examples) approached through text processing. BERT is an important language processing model and serves as the target. Low detectability plays a role in the attack model.

Attacks
Adversarial Examples

Johnson 2013— Rise of New Machine Ecology

Johnson, Neil, Guannan Zhao, Eric Hunsader, Hong Qi, Nicholas Johnson, Jing Meng, and Brian Tivnan. “Abrupt rise of new machine ecology beyond human response time“. Scientific reports 3, no. 1 (2013): 1-7.

You’ve probably heard of high frequency trading, flash crashes, etc. This paper explains how adaptive algorithms are involved in this activity and how they happen at subhuman perception speeds. A picosecond is a thing.

AI-Philosophy
Cognitive Psych

Jones 2004 — NLP and Generative Models

Jones, Karen. 2004 Language modelling’s generative model: is it rational?. Technical Report, University of Cambridge. June 2004.

A hilarious paper that is critical of LM’s (as defined very tightly by the author). Appendix is more useful than the rambling main text.

NLP
Generation

Jumper 2021— AlphaFold

Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool et al. “Highly accurate protein structure prediction with AlphaFold.” Nature 596, no. 7873 (2021): 583-589.

A difficult to read paper (due mostly to unfamiliarity with the large number of subfields), but very interesting work. Computational geometry, optimization, physics, microbiology, evolution… combined into a notably better deep learning system informed by science. Hybrid model for the win.

Representation

Juuti 2019— Model Extraction

Juuti, Mika, Sebastian Szyller, Samuel Marchal, and N. Asokan. “PRADA: protecting against DNN model stealing attacks.” In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 512-527. IEEE, 2019.

This paper is very good (and is number 6 in our top 5 list). A super clear treatment of extraction attacks and adversarial examples with nice notation, excellent algorithmic description, and solid basic concepts. Describes improved and generalized extraction attacks and protections against them. The protections are somewhat naïve.

Attacks
Extraction
Adversarial Examples

Kairouz 2019— Generative Adversarial Models and Bias

Kairouz, Peter, Jiachun Liao, Chong Huang, Maunil Vyas, Monica Welfert, and Lalitha Sankar. “Generating Fair Universal Representations using Adversarial Models.” arXiv preprint arXiv:1910.00411 (2019).

A clear but very dense paper. Use GANs to hide sensitive features in a representation. The encoder tries to find the sensitive features. This purports to work on fairness.

Representation
Bias
Generation

Kaplan 2020— Enormous Transformers

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. “Scaling Laws for Neural Language Models” arXiv preprint arXiv:2001.08361 (2020).

Easy, straightforward paper, seminal in the scaling literature. We revisited this one after four years. The only issue missing is any notion of data quality (vs data set size). Cardinality of compute and data is a good start.

Representation

Kazemi 2019— Time

Kazemi, Seyed, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, Marcus Brubaker. “Time2Vec: Learning a Vector Representation of Time” arXiv preprint arXiv:1907.05321 (2019).

Very abstract treatment of time represented as a learned periodic vector. More engineering than ML.

Representation

Kemker 2018— Catastrophic Forgetting

Kemker, Ronald, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. “Measuring Catastrophic Forgetting in Neural Networks.” In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1. 2018.

This paper is (sadly) undergraduate level work. Good philosophy but very much uninspired philosophy. Kind of “we did a bunch of random things and here are the results.”

Representation
Attacks
catastrophic Forgetting

Kilbertus 2018— Learning and Causality

Kilbertus, Niki, Giambattista Parascandolo, Bernhard Schölkopf. “Generalization in anti-causal learning” arXiv preprint arXiv:1812.00524 (2018).

A vague position paper that is more philosophy than anything else. Emphasizes the importance of generation (and causal models). Representation issues around continuity are explored.

Representation

Kleinberg 2016— Bias Tradeoffs

Kleinberg, Jon, Sendhil Mullainathan, Manish Raghavan. “Inherent Trade-Offs in the Fair Determination of Risk Scores” arXiv preprint arXiv:1609.05807 (2016).

Very strong for a bias paper. Brings some rigor to goal states and makes clear that tradeoffs exist. If you read only one bias paper, read this one.

Bias

Koh 2017— Influence Functions

Koh, Pang Wei and Percy Liang. “Understanding Black-box Predictions via Influence Functions” arXiv preprint arXiv:1703.04730 (2017).

Understanding adversarial inputs. Getting the “same” result through diverse paths. Influence functions, representation, and positive/negative data points.

Representation
Adversarial Examples

Kosinski 2013 — Data About You

Kosinski, Michal, David Stillwell, Thore Graepel. “Private traits and attributes are predictable from digital records of human behavior .” Proceedings of the National Academy of Science 110(15): 5802-5.

A classic paper. Facebook (Meta) knows more about you than you think. Algorithms will use this to manipulate you.

Data

Krizhevsky 2012 — Convolutional Nets (ReLU)

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

Elegant series of hacks to reduce overfitting. A bit of hand waving. Reference to CPU speed and huge data sets. Depth is important, but nobody knows why.

Time
Representation

Krotov 2016 — Hopfield Nets

Krotov, Dmitry, and John J. Hopfield. “ Dense associative memory for pattern recognition.” arXiv preprint arXiv:1606.01164 (2016).

This is a very solid introductory explanation of modern Hopfield nets. A bit “mathy” but with an important result that is worth unpacking and understanding.

For more explanation on Hopfield Networks, watch these videos with Dmitry Krotov.

Representation

Kurita 2020— Transfer attacks (backdoors)

Kurita, Keita, Paul Michel, Graham Neubig. “Weight Poisoning Attacks on Pre-trained Models” arXiv preprint arXiv:2004.06660 (2020).

Transfer attacks (one of the six BIML attack categories). Very basic results. Fairly obvious. Simple. Nice. Clear. (The only bug is poor terminology…misuse of “backdoor” which has crept into the MLsec literature.)

Attacks
Transfer Attack

Lake 2015 — Cogsci

Lake, Brenden, Ruslan Salakhutdinov, Joshua Tenenbaum. “Human-level concept learning through probabilistic program induction.” Science, vol. 350, no. 6266 (2015): 1332-1338.

Representation, models, and one-shot learning. A study promoting BPL.

AI-Philosophy
Representation

Lake 2017 — Recurrent Net Weakness

Lake, Brenden, and Marco Baroni. “Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks.” (2018).

Naive micro domain with misleading maps into human semantics (movement). An artificial attack angled with structure as weapon.

AI-Philosophy
Representation

Lake 2020— Concepts

Lake, Breden and Gregory L. Murphy. “Word meaning in minds and machines” arXiv preprint arXiv:2008.01766 (2020).

Super clear (maybe obvious) treatment of fluid concepts a la dughof. Getting past the bag of words.

AI-Philosophy
Representation

Lampinen 2024 — DeepMind Representations

Lampinen, Andrew Kyle, Stephanie C. Y. Chan, and Katherine Hermann. “Learned feature representations are biased by complexity, learning order, position, and more” arXiv preprint arXiv:2405.05847 (2024).

Excellent work moving towards the heart of ML based on gradients.

AI-Philosophy
Representation

Langford 2024 — Trojans in the Most Obvious Sense

Langford, Harry, Illa Shumailov, Yiren Zhaao, Robert Mullins, and Nicolas Papernot. “Architectural Neural Backdoors from First Principles” arXiv preprint arXiv:2402.06957 (2024).

Such great people doing such silly work. This work puts a Trojan in the WHAT machine itself (not interesting) instead of in the WHAT (data). That makes it not only obvious, but boring.

MLsec

Lapuschkin 2019 — Unmasking Clever Hans predictors

Lapuschkin, Sebastian, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. “Unmasking Clever Hans predictors and assessing what machines really learn.” Nature communications 10, no. 1 (2019): 1096.

This paper, though interesting, is limited to addressing only visual domains. Pixel relevance thus plays an outsize role in its treatment. A decent treatment to help counter ML hype, but no as strong as it could be.

AI-Philosophy
Adversarial Examples

Legg 2007— Universal Intelligence Definition

Shane Legg, Marcus Hutter “Universal Intelligence: A Definition of Machine Intelligence” arXiv preprint arXiv:0712.3329 (2007).

This is as much a philosophy paper as it is an ML paper. Well worth a read, especially if you are not familiar with philosophy of mind and how it pertains to AI. Defines a (non-computable) measure of intelligence and then tries to move that to something useful.

AI-Philosophy

Lewis 2024— Analogy and LLMs

Lewis, Martha and Melanie Mitchell “Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models” arXiv preprint arXiv:2402.08955 (2024).

Analogy-making and LLMs. This is pretty obvious work, but necessary.

AI-Philosophy
Cognitive Psych

Lin 2024— LLM Boogieman

Lin, Zilong, Jian Cui, Xiaojing Liao, XiaoFeng Wang. “Malla: Demystifying Real-world Large Language Model Integrated Malicious Services” arXiv preprint arXiv:2401.03315 (2024).

Boogieman! Crime! Oh my. Exaggerated/sensational language doesn’t help. Generative ML used for cyber badness.

Attacks
LLM

Liu 2024 — Alignment with Smaller LLM

Liu, Alisa, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith. “Tuning Language Models by Proxy.” arXiv preprint arXiv:2401.08565.(2024).

Pulling fine-tuning out of the black box to make it cheaper. Very much inside baseball (badly described and motivated). Clearly no cognitive science background. Technically very interesting.

LLM
Alignment

Longpre 2025 — Data Pollution

Longpre, Shayne, Kevin Klyman, Ruth E. Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey John Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark M. Jaycox, Markus Anderljung, Nadine Farid Johnson, Nicholas Carlini, Nicolas Miailhe, Nik Marda, Peter Henderson, Rebecca S. Portnoff, Rebecca Weiss, Victoria Westerhoff, Yacine Jernite, Rumman Chowdhury, Percy Liang, Arvind Narayanan. “In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI.” arXiv preprint arXiv:2503.16861.(2025).

A very pollyanna view based on a poor understanding of the sofwtare security solution. B uilding security in IS NOT penetrate and patch.

Software Security
MLsec
Benchmarks

Magar 2022 — Data Pollution

Magar, Inbal, Roy Schwartz. “Data Contamination: From Memorization to Exploitation.” arXiv preprint arXiv:2203.08242.(2022).

This paper is somehow already dated (at just under two years old) but has some important ideas not to forget. Makes use of BERT.

LLM
Alignment
Data

Marcus 2018 — AI Perspective on ML

Marcus, Gary. “Deep learning: A critical appraisal.” arXiv preprint arXiv:1801.00631 (2018).

General overview tainted by old school AI approach. Makes clear the overlooking of representation as essential. Some failure conditions noted, at philosophical level.

AI-Philosophy

Martinez 2023 — Recursive Pollution

Martínez, Gonzolo, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, Rik Sarkar. “Combining Generative Artificial Intelligence (AI) and the Internet: Heading towards Evolution or Degradation?.” arXiv preprint arXiv:2303.01255 (2023).

This paper is pretty not bad. A hint (but only a hint) of the recursive pollution problem. A minor experiment.

Data
Recursive Pollution

McClelland 2020 — NN/NLP History

McClelland, James L., Felix Hill, Maja Rudolph, Jason Baldridge, and Hinrich Schütze. “Extending Machine Language Models toward Human-Level Language Understanding.” arXiv preprint arXiv:1912.05877 (2020).

A concise and clear history of NN and NLP. Addresses situations, neurophysiology, and sensory fusion.

Cognitive Psych
NLP

McKinzie 2024 — NN/NLP History

McKinzie, Brandon, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang. “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.” arXiv preprint arXiv:2403.09611 (2024).

Clear paper on the edge that includes the usual bugs: scaling assumptions and coding to the tests. This is Apple’s version.

Engineering
LLM

McGuffie 2020 — Terrorism policy BS

McGuffie, Kris, and Alex Newhouse “The Radicalization Risks of GPT-3 and Advanced Neural Language Models” Technical Report (2020).

A lightly reasoned paper that claims that GPT-3 capabilities (which are apparently assumed to have passed the Turing Test) will lead to more radicalization. Grab your pearls.

Policy

Meng 20203 — Representation of Facts

Meng, Kevin, David Bau, Alex Andonian, Yonatan Belinkov. “Locating and Editing Factual Associations in GPT” arXiv preprint arXiv:2202.05262 (2023).

Walks up to the distributed representation line but fails to cross it. Discusses activation versus weights in representation. “Causal” theory.

Representation

Merrill 2020 — RNN Theory

Merrill, William, Gail Weiss, Yoav Goldberg, Roy Schwartz, Noah A. Smith, Eran Yahav. “A Formal Hierarchy of RNN Architectures” arXiv preprint arXiv:2004.08500 (2020).

A CS theory paper that combines two lines of research: rational recurrence and sequential NNs as automata. Continuous inputs may be a problem.

Representation

Mitra 2024 — Microsoft Agentic Synthetic Data

Mitra, Arindam, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah. “AgentInstruct: Toward Generative Teaching with Agentic Flows” arXiv preprint arXiv:2407.03502 (2024).

Microsoft discussion of agentic flows using synthetic data (post training). This all seems to be veering off track to us. Creator, watcher, judge loops. But synthetic data assumes we know HOW, so why build a WHAT machine?

Synthetic Data
Agents
Recursive Pollution

Mittelstadt 2016 — Ethics

Brent Daniel Mittelstadt, Brent Daniel, Patrick Allo,Mariarosaria Taddeo, Sandra Wachter and Luciano Floridi. “The ethics of algorithms: Mapping the debate.” Big Data & Society, July–December 2016: 1-21.

Weird usage of the term “algorithm” (becoming all too common). An OK map.

Bias

Mireshghallah 2021 — Preserving Query Privacy

Mireshghallah, Fatemehsadat, Mohammadkazem Taram, Ali Jalali, Ahmed Taha Taha Elthakeb, Dean Tullsen, and Hadi Esmaeilzadeh. “Not all features are equal: Discovering essential features for preserving prediction privacy.” In Proceedings of the Web Conference 2021, pp. 669-680. 2021.

Part of ML security data protection is protecting query data when an ML system is in operation. The technology described here is being commercialized by Protopia AI. A popular press article in Dark Reading by BIML explains this. Preserve important features in query and obscure the rest.

MLsec

Mirzadeh 2024 — Synthetic Data Shows No Reasoning in LLMs

Mirzadeh, Iman, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” arXiv preprint arXiv:2410.05229 (2024).

Sophisticated pattern matching IS NOT reasoning. Testing and probing LLM “reasoning” using synthetic data generation. A mostly negative result from Apple.

LLM
Synthetic Data
Evaluation

Mitchell 2019 — Model Cards

Mitchell, Margaret , Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, Timnit Gebru . “Model Cards for Model Reporting” arXiv preprint arXiv:1810.03993 (2019).

A mix of sociology and political correctness with engineering transparency. Human-centric models emphasized.

Engineering
Bias

Mitchell 2021 — Abstraction and Analogy-Making

Mitchell, Melanie. “Abstraction and Analogy-Making in Artificial Intelligence.” arXiv preprint arXiv:2102.10717 (2021).

A nice overview and comparison of Copycat, DNN’s, and program induction approaches to AI through the lense of analogy making. If you believe that perception is analogy-making, then you will find this work interesting. The coverage of DNNs is a little sparse, however.

LLM
AI-Philosophy
Cognitive Psych

Mitchell 2021 — Why AI is Harder Than We Think

Mitchell, Melanie. “Why AI is Harder Than We Think“. arXiv preprint arXiv:2104.12871 (2021).

Super clear paper about the four AI fallacies leading to the AI Winter Sine Wave. Excellent read.

AI-Philosophy

Mitchell 2022 — LLM Understanding

Mitchell, Melanie, and David C. Krakauer. “The Debate Over Understanding in AI’s Large Language Models“. arXiv preprint arXiv:2210.13966 (2022).

This paper is an excellent problem statement that clearly points out the criticality of understanding “understanding” in current ML systems. Makes an excellent primer.

NLP
LLM
Evaluation

Mnih 2013 — Atari

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. “Playing atari with deep reinforcement learning.” arXiv preprint arXiv:1312.5602 (2013).

An application of convolution nets where the game representation has been shoved through a filter. Some questions open regarding randomness in the game (making the games very hard to learn). Not dice rolling for turn, but rather random behavior that is cryptographically unpredictable. This paper made a bigger splash than it likely warranted.

Games

Moosavi-Dezfooli 2017 — Classifier Robustness (Statistics)

Moosavi-Dezfooli, Seyed-Mohsen, Alhussein Fawzi, Omar Fawzi, Pascal Frossard, and Stefano Soatto. “Robustness of classifiers to universal perturbations: A geometric perspective.” arXiv preprint arXiv:1705.09554 (2017).

This paper presents a very deep result in a somewhat opaque way to non-statisticians. The idea of establishing minimality of perturbation seems to have important implications for security testing and determining risk values. The constructive/experimental approach to driving theory is impressive.

MLSec

Moschella 2023 — Relative representations

Moschella, Luca, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, Emanuele Rodolà. “Relative representations enable zero-shot latent space communication.” arXiv preprint arXiv:2209.15430 (2023).

This paper discusses the geometry of diverse models, and how they are similar. Relative representation allows alignment (to the extent that it works). Things of a feather verb together.

Representation

Mozes 2023 — LLM Attacks are Boring

Mozes, Maximilian, Xuanli He, Bennett Kleinberg, Lewis D. Griffin. “Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities.” arXiv preprint arXiv:2308.12833 (2023).

This paper is junk. Very basic overview with no insight. Too much emphasis on testing (esp pen testing).

Attacks
LLM

Narayanan — How to recognize AI snake oil

Narayanan, Arvind. “How to recognize AI snake oil.”

This presentation emphasizes the importance of data sets (and simple analysis) when it comes to ML. Also discusses prediction. Predicting things we don’t really understand is hard and deep neural nets suck at it.

AI-Philosophy
Data
Evaluation

Nasr 2023 — Extraction from LLMs

Milad Nasr, Nicholas Carlini, Jonathan Hayase1, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee. “Scalable Extraction of Training Data from (Production) Language Models.” arXiv preprint arXiv:2311.17035 (2023)

Another excellent, clear piece of work from Carlini. We can pull the tokens right out. Extractable memorization (which we believe can be used to fingerprint datasets) should be useful in copyright legal cases involving LLM training data.

MLsec
LLM
Attacks
Extraction

Nezhurina 2024 — Prompt Injection is Boring

Nezhurina, Marianna, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev. “Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models.” arXiv preprint arXiv:2406.02061 (2024)

This work is pretty simple, but it is related in overly complicated and very misleading terms. Simple results are neither surprising nor that interesting.

MLsec
LLM
Attacks
Prompt Injection

NIST 2023 — NIST “Adversarial”

NIST: Apostol Vassilev, Alina Oprea, Alie Fordyce, and Hyrum Anderson. “Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations.” NIST Trustworthy and Responsible AI, NIST AI 100-2e2023. (2023)

Laundry list of attacks without a very useful taxonomy. Spotty terminology. Suggests that pen testing is a good paradigm for solution. We found this one so problematic that we wrote it up on the blog.

MLsec
Attacks

Oh 2018 — Reversing NNs through queries

Oh, Seong Joon, Max Augustin, Bernt Schiele, Mario Fritz. “Towards Reverse-Engineering Black-Box Neural Networks.” arXiv preprint 1711.01768 (2018)

A goofy, clever, interesting paper that compliments Wang. Well-written but not too deep.

Engineering

Olshausen 2004 — Sparse Coding of Sensory Inputs

Olshausen, Bruno A., and David J. Field. “Sparse coding of sensory inputs.” Current opinion in neurobiology 14, no. 4 (2004): 481-487.

Using natural observations from biology, this paper helps bolster the case for sparse distributed representations of the Kanerva variety. Solid work in neurobiology should inform ML theorists more than it does.

Representation
Neurophysiology

Ouyang 2022 — Alignment

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang et al. “Training language models to follow instructions with human feedback.” Advances in Neural Information Processing Systems 35 (2022): 27730-27744.

Mostly a corporate ass covering exercise. InstructGPT still makes the same terrible mistakes that GPT3 does. Getting the first 90% of the way towards solving a problem is easy. The rest of the way is much harder. Also trivial bypasses remain, and the system used to impact GPT can be turned on its head to make things worse.

NLP
Alignment
LLM

Pan 2024 — Recusive Pollution

Pan, Alexander, Erik Jones, Meena Jagadeesan, Jacob Steinhardt. “Feedback Loops With Language Models Drive In-Context Reward Hacking.” arXiv preprint arXiv:2402.06627 (2024).

Minor approach to a major problem. Perverse incentives hinted at. Not bad, but the simulated experiments are too obvious.

MLsec
Recursive Pollution

Papernot 2018 — Building Security In for ML (IT stance)

Papernot, Nicolas. “A Marauder’s Map of Security and Privacy in Machine Learning.” arXiv preprint arXiv:1811.01134 (2018).

Tainted only by an old school IT security approach, this paper aims at the core of #MLsec but misses the mark. Too much ops and not enough security engineering.

MLsec

Park 2020 — XAI and Economics

Park, Geondo, June Yong Yang, Sung Ju Hwang, and Eunho Yang. “Attribution Preservation in Network Compression for Reliable Network Interpretation.” arXiv preprint arXiv:2010.15054 (2020).

Representation and XAI meet economic reality. Solid work explained a bit obtusely.

Representation
Explainable AI

Paullada 2021 — XAI and Economics

Paullada,Amandalynne, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, Alex Hanna. “Data and its (dis)contents: A survey of dataset development and use in machine learning research.” arXiv preprint arXiv:2012.05345 (2021).

Willy nilly data piles are bad. The machine becomes the data.

Representation
Data

Perez 2022 — LLM Attacks

Perez, Fábio, and Ian Ribeiro. “Ignore Previous Prompt: Attack Techniques For Language Models.” arXiv preprint arXiv:2211.09527 (2022).

Another not very useful paper. Obvious and yet the experiments are not useful at understanding impact. What is the author’s theory of behavior? Not very good work.

Attacks
Prompt Injection

Peters 2018 — ELMo

Peters, Matthew E, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. “Deep contextualized word representations.” arXiv preprint 1802.05365 (2018)

Important seminal work on ELMo. Some echoes of SDM and highly distributed representation power.

Representation

Phillips 2011 — Racism

Phillips, P. Jonathon, Fang Jiang, Abhijit Narvekar, Julianne Ayyad, and Alice J. O’Toole. “An other-race effect for face recognition algorithms.” ACM Transactions on Applied Perception (TAP) 8, no. 2 (2011): 14.

This paper is pretty stupid. The result is simply “when your data are racists, your system will be too” which is beyond obvious for anyone who knows how ML works. This its what happens when psych people write about ML instead of CS people

Bias

Prabhakar 2024 — Chain of Thought Debunked

Prabhakar, Akshara, Thomas L. Griffiths, R. Thomas McCoy. “Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning” arXiv preprint arXiv:2407.01687 (2024).

Some science for a change. Clear, concise, excellent experiments driven by hypothesis. The choice of ROT13 is sheer genius.

LLM
Chain of Thought
Evaluation

Quinn 2017 (also mm17) — Dog Walker

Quinn, Max H., Erik Conser, Jordan M. Witte, Melanie Mitchell. “Semantic Image Retrieval via Active Grounding of Visual Situations” arXiv preprint arXiv:1711.00088 (2017).

Building up representations with a hybrid Copycat/NN model. Hofstadterian model. Time as an essential component in building up a representation.

AI-Philosophy
Representation

Rahwan 2019 — Towards a Study of Machine Behavior

Rahwan, Iyad, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W. Crandall, Nicholas A. Christakis, Iain D. Couzin, Matthew O. Jackson, Nicholas R. Jennings, Ece Kamar, Isabel M. Kloumann, Hugo Larochelle, David Lazer, Richard McElreath, Alan Mislove, David C. Parkes, Alex ‘Sandy’ Pentland, Margaret E. Roberts, Azim Shariff, Joshua B. Tenenbaum & Michael Wellman. “Machine behavior.” Nature 568 (2019): 477-486.

Social science on machines. Very clear treatment. Trinity of trouble hinted at. Good analogs for security. Is ML code/data open source or not?

Evaluation

Rajalingham 2018 — Humans, Monkeys, and Neural Networks

Rajalingham, Rishi, Elias B. Issa, Pouya Bashivan, Kohitij Kar, Kailyn Schmidt, and James J. DiCarlo. “Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks..” Journal of Neuroscience 38, no. 33 (2018): 7255-7269.

Primates do it different. ANN (large scale neural network) models don’t compute the same way primates do. Psychophysics. A bit of a weird set of experiments.

Cognitive Psych

Raji 2021 — Evaluation is Hard

Raji, Inioluwa Deborah, Emily M. Bender, Amandalynne Paullada, Emily Denton, Alex Hanna. “ AI and the Everything in the Whole Wide World Benchmark” arXiv preprint arXiv:2111.15366 (2021).

Turns out that SOTA tests are not good. Evaluation of actual skills can’t be done by benchmarks. Good solid work that is not at all surprising.

Evaluation

Ramesh 2021 — Dall-E

Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. “ Zero-shot text-to-image generation” arXiv preprint arXiv:2102.12092 (2021).

An explanation of Dall-E, without much more to it than treating the system as an engineering exercise. Too dense and not enough grounding in philosophy.

Generation

Ramsauer 2020 — Hopfield Networks

Ramsauer, Hubert, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter. “Hopfield Networks is All You Need” arXiv preprint arXiv:2008.02217 (2020).

Monster of a paper with a 12 page summary at the top. Best to start with Hopfield:Krotov. Attention is like a hopfield layer.

Representation

Rando 2025 — Machine Learning Security

Rando, Javier, Jie Zhang, Nicholas Carlini, Florian Tramèr. “Adversarial ML Problems Are Getting Harder to Solve and to Evaluate” arXiv preprint arXiv:2502.02260v1 (2025).

A well-reasoned and obvious paper. Entirely misses the case of recursive pollution.

MLsec

Ren 2024 — Benchmarks

Ren, Richard, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks. “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?” arXiv preprint arXiv:2407.21792 (2024).

A dull paper full of the obvious. See the section on security if you read it.

Representation

Rendell 2010 — Social Learning and Game Theory

Rendell, Luke, Robert Boyd, Daniel Cownden, Marquist Enquist, Kimmo Eriksson, Marc W. Feldman, Laurel Fogarty, Stefano Ghirlanda, Timothy Lillicrap, and Kevin N. Laland. “Why copy others? Insights from the social learning strategies tournament.” Science 328, no. 5975 (2010): 208-213.

A very interesting treatment of social learning through game theory. Turns out that copying is a good strategy. This paper is mostly about the multi-armed bandit experiment.

Game theory
Symbol grounding

Ribeiro 2016 — Explaining the Predictions of Any Classifier

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. ““Why Should I Trust You?”: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135-1144. 2016.

LIME. This paper is the source of the famous (and mis-used) wolf/husky/snow example; the hypothetical example is often cited as a real ML system error. Explainable ML or XAI.

Evaluation
Explainable AI

Ribeiro 2020 — ATCG for NNs

Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. “Beyond Accuracy: Behavioral Testing of NLP models with CheckList.” arXiv preprint arXiv:2005.04118 (2020).

Very basic approach to bbox ATCG that begins to ask the question WHAT exactly should be tested and how to get past accuracy. Obvious and fairly shallow from a testing perspective.

Engineering
Evaluation

Rolnick 2020 — Reversing NNs

Rolnick, David and Konrad P. Kording. “Reverse-Engineering Deep ReLU Networks” arXiv preprint arXiv:1910.00744 (2020).

Inverting an NN (linear) from queries. Great theory. Unclear about feasibility in production.

Attacks

Roth 2019 — Detecting Adversarial Examples

Roth, Kevin, Yannic Kilcher, and Thomas Hofmann. “The Odds are Odd: A Statistical Test for Detecting Adversarial Examples” In International Conference on Machine Learning, pp. 5498-5507. PMLR, 2019.

Exploring, “what is an adversarial example? Where do they come from? How can we detect them statistically?”

Attacks
Adversarial Examples

Roughgarden 2010 — Algorithmic game theory

Roughgarden, Tim. “Algorithmic game theory.” Communications of the ACM 53, no. 7 (2010): 78-86.

Dry and a bit boring. Where and when might it make sense to regulate ML?

Policy
Regulation
Math

Rule 2020 — Child as Hacker

Rule, Joshua, Joshua B. Tenenbaum, Steven T. Piantadosi. “The Child as Hacker.” Trends in Cognitive Science (2020), Vol 24, No 11, 900-915. November 2020.

Philosophically great, but conceptually vague. Stochastic programs. Programs as representations.

AI-Philosophy

Sambasivan 2021 — Understanding

Sambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Kumar, Paritosh Lora Aroyo. “Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI.” Proceedings of the CHI’21 (Yokohama, Japan).

It’s the data, stupid. Poison in the data discussed in a feel good manifesto. Too much social justice nonsense to fly.

Data
Data Poisoning

Samuel 2023 — Smaller Better Data

Samuel, David, Andrey Kutuzov, Lilja Øvrelid, Erik Velldal. “Trained on 100 million words and still in shape:BERT meets British National Corpus.” arXiv preprint arXiv:2303.09859. (2023).

Efficient language learning with better data. Data versus computer. This dataset is 140,000 times smaller than GPT4’s dataset.

Data
NLP

Santoro 2021 — Symbolic Behaviour

Santoro, Adam, Andrew Lampinen, Kory Mathewson, Timothy Lillicrap, and David Raposo. “Symbolic Behaviour in Artificial Intelligence.” arXiv preprint arXiv:2102.03406 (2021).

A modern approach to the symbol grounding problem that talks about emergence of symbols through symbolic behavior. A very nice paper.

AI
AI-Philosophy

Schaeffer 2023 — Evaluating Emergence

Schaeffer, Rylan, Brando Miranda, Sanmi Koyejo. “Are Emergent Abilities of Large Language Models a Mirage?.” arXiv preprint arXiv:2304.15004 (2023).

Metrics matter and when you use the wrong ones, you may see things that are not there…like emergent capabilities. This is great work.

Evaluation
LLM

Schmidhuber 2010 — Creativity

Jürgen Schmidhuber. “Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010)” Technical Report (2010).

Post-facto justification of “the thing I built.” An overview with interesting mappings to aesthetics and self-motivation. The lossless compression angle is weird. Flirts with innocent crackpotism.

Creativity

Schulam 2017 — Counterfactual Models, Kinda

Schulam, Peter, and Suchi Saria. “Reliable Decision Support using Counterfactual Models” Advances in neural information processing systems 30 (2017).

So many assumptions that the base idea becomes warped. High hopes dashed.

Representation
Time

Sculley 2014 — Technical Debt

Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. “Machine learning: The high interest credit card of technical debt.” (2014).

A diatribe against deadline and just making stuff work. Naive criticism of flaws.

Engineering
Representation

Sculley 2015 — Software Engineering Would Help

Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. “Hidden technical debt in machine learning systems.” In Advances in neural information processing systems, pp. 2503-2511. 2015.

Random kludges built of interlocked pieces and parts is a bad idea. This applies to ML as well. Light on analysis and mis-directed on focus.

Engineering

Sculley 2018 — Building Security In for ML (IT stance)

Sculley, D., Jasper Snoek, Ali Rahmini, Alex Wiltschko. “Winner’s Curse? On Pace, Progress, and Empirical Rigor.” ICLR 2018 Workshop paper (2018).

Argues for a scientific approach. General and pretty obvious.

Engineering

Schwartz-Ziv 2017— Representation

Shwartz-Ziv, Ravid, and Naftali Tishby. “Opening the black box of Deep Neural Networks via Information.” arXiv preprint arXiv:1703.00810 (2017).

An opaque paper on representation. Pretty far afield from security.

Representation

Sejnowski 2022 — Reverse Turing Test

Sejnowski, Terrence. “Large language models and the reverse turing test.” arXiv preprint arXiv:2207.14382 (2022).

Written by a giant in the field, this paper very nicely describes what we can huMANSPLAINING and LLMs. The reverse Turing test fits the paradigm described by the quote, “there are no good books, only good readers.” This is a great paper that is lots of fun to read.

AI-Philosophy
LLM

Shankar 2020 — Microsoft on MLsec

Shankar, Ram Siva Kumar, Magnus Nyström, John Lambert, Andrew Marshall, Mario Goertzel, Andi Comissoneru, Matt Swann, Sharon Xia “Adversarial Machine Learning — Industry Perspectives” arXiv preprint arXiv:2002.05646 (2020).

Microsoft’s first stab at Threat Modeling for ML. Problems with nomenclature are par for the course for Microsoft (e.g., “adversarial ML” should be “MLsec”). This is a solid start but needs deeper thought. More emphasis on design would help. Also see this BIML blog entry.

Engineering
MLsec

Shi 2023 — LLM Distraction

Shi, Freda, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. “Large Language Models Can Be Easily Distracted by Irrelevant Context.” arXiv preprint arXiv:2302.00093 (2023).

A simple little experiment that discusses one or two examples of distraction. Perhaps this paper itself is a distraction.

LLM
MLsec

Shi 2024 — Detecting Training Data

Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer. “Detecting Pretraining Data from Large Language Models.” arXiv preprint arXiv:2310.16789 (2024).

Excellent work on detecting data sets through a focus on gaussian tails. Unfortunately, the detection approach will fail under recursive pollution. We expect this method will be used in court cases involving copyrighted material in training sets.

LLM
MLsec
Recursive Pollution

Shinn 2023 — Agent Self-consciousness or Not

Shinn, Noah, Beck Labash, and Ashwin Gopinath. “Reflexion: an autonomous agent with dynamic memory and self-reflection.” arXiv preprint arXiv:2303.11366 (2023).

More of a progress report than a finished article. Poor ties to the AI literature (outside of ML). Reading the transcripts really puts things in perspective.

Consciousness
Machine-Learning

Shrager 2024 — ELIZA obsession

Shrager, Jeff. “ELIZA Reinterpreted: The world’s first chatbot was not intended as a chatbot at all.” arXiv preprint arXiv:2406.17650 (2024).

History full of silly quibbles. Kind of a fun paper, but no really meaningful deep insights. Read for history.

AI-Philosophy

Shu 2020 — Disentaglement

Shu, Rui, Yining Chen, Abhishek Kumar, Stefano Ermon, Ben Poole. “Weakly Supervised Disentanglement with Guarantees.” arXiv preprint arXiv:1910.09772 (2020).

A complex paper on representation. Worth a close reading.

Representation

Shumailov 2018 — Detection of Adversarial Examples

Shumailov, Ilia, Yiren Zhao, Robert Mullins, and Ross Anderson. “The Taboo Trap: Behavioural Detection of Adversarial Samples.” arXiv preprint arXiv:1811.07375 (2018).

Collecting activation patterns and using them to build boundaries. Stopping (simple) adversarial examples cheaply. Clear good paper with some important lessons about representation.

Attacks
Adversarial Examples

Shumailov 2021 — Data Ordering Attacks

Shumailov, Ilia, Zakhar Shumaylov, Dmitry Kazhdan, Yiren Zhao, Nicolas Papernot, Murat A. Erdogdu, and Ross Anderson. “Manipulating SGD with Data Ordering Attacks.” arXiv preprint arXiv:2104.09667 (2021).

Another super clear fun paper from Anderson and company. Turns out that you can poison ML systems by changing the ordering of training data. RNGs are critical to good ML behavior and are an important attack vector.

Attacks
Training Order

Silver 2017 — AlphaGo

Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert et al. “Mastering the game of go without human knowledge.” Nature 550, no. 7676 (2017): 354.

AlphaGo trains itself by playing itself. Surprising and high profile results. Monte Carlo tree search seems to underly the results (which representations are amenable to that kind of search?). Unclear how general these results are or if they only apply to certain games with fixed rules and perfect knowledge.

Games

Slack 2020 — Adversarial Classifiers

Slack, Dylan, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. “Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods“. arXiv preprint arXiv:1911.02508 (2020).

Adversarial classifiers with a focus on ML bias including racism and sexism in black box models.

Attacks
Adversarial Examples

Somepalli 2022 — Investigating Data Replication in Diffusion Models

Somepalli, Gowthami, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. “Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models.” arXiv preprint arXiv:2212.03860 (2022).

This paper is not very good work. “We did some stuff” with no particular insight is really more appropriate for a blog entry than a paper. There are many other better works on the same topic. Catchy title though.

Generation

Spelke 2007 — Adversarial Classifiers

Spelke, Elizabeth S., and Katherine D. Kinzler. “Core knowledge“. Developmental science 10, no. 1 (2007): 89-96.

A tiny, tight little ditty. Describes four core knowledge systems (OANS): objects, actions, number, and space, then adds a new one for social group recognition.

Cognitive Psych

Springer 2018 — Sparse Coding is Good

Arora, Sanjeev, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. “Classifiers Based on Deep Sparse Coding Architectures are Robust to Deep Learning Transferable Examples.” Transactions of the Association of Computational Linguistics 6 (2018): 483-495.

Important theory, but silly experiment. Hints at the importance of context, concept activation, and dynamic representation. Explores limits of transfer attacks WRT representation

Representation

Stojnić 2023 — Babies not like Neural Networks

Stojnić, Gala, Kanishk Gandhi, Shannon Yasuda, Brenden M. Lake, and Moira R. Dillon. “Commonsense psychology in human infants and machines.” Cognition 235 (2023): 105406.

The philosophy behind this paper suggests that a large set of cognitive appartus (which is evolved in) is essential to building AI/ML that works. The work is a bit far fetched, though the idea of running cognitive experiments on babies is kind of fun. Main result: today’s ML is not at all like what humans do.

Cognitive Psych
Philosophy

Stretcu 2020 — Curriculum Learning

Stretcu, Otilia, Emmanouil Antonios Platanios, Tom Mitchell, Barnabás Póczos. “Coarse-to-Fine Curriculum Learning for Classification.” ICLR 2020 Workshop paper (2020).

Ties to error-making, confusion matrices, and representation

Representation

Sucholutsky 2023 — Alignment survey

Sucholutsky, Ilia, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O’Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, and Thomas L. Griffiths. “Getting aligned on representational alignment” arXiv preprint arXiv:2310.13018 (2023).

Survey paper with a fairly reasonable framework. Moving slowly towards a science of representation.

Representation
Alignment

Sundararajan 2017 — Explaining Networks

Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. “Axiomatic Attribution for Deep Networks” arXiv preprint arXiv:1703.01365 (2017).

A strangely-written paper trying to get to the heart of describing why a network does what it does. Quirky use of mathematical style. Hard to understand and opaque.

Representation

Szyller 2022 — Taking Down Dataset Inference

Szyller, Sebastian, Rui Zhang, Jian Liu, and N. Asokan. “On the Robustness of Dataset Inference.” arXiv preprint arXiv:2210.13631 (2022).

An interesting take down of Dataset Inference, one of the most useful “machine fingerprinting” ideas. Turns out that deterrence against extraction attacks has a long way to go to actually work.

MLSec

Tamirisa 2025 — Arms Race Anti-Tampering

Tamirisa, Rishub, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika. “Tamper-Resistant Safeguards for Open-Weight LLMs.” arXiv preprint arXiv:2408.00761 (2025).

Preety dumb. Screw up the model to make it better under stupid testing. This approach devolves to an arms race.

MLSec

Tenenbaum 2011 — Review

Joshua B. Tenenbaum, Charles Kemp, Thomas L. Griffiths, Noah D. Goodman. “How to Grow a Mind: Statistics, Structure, and Abstraction.” Science, vol. 331, no. 6022 (2011): 1279-1285.

A practical philosophy of AI paper focused on bridging the usual symbolic vs sub-symbolic chasm. Overfocus on HBMs, but worth a read to understand the role that structure plays in intelligence and representation.

AI-Philosophy

Thoppilan 2022— LaMDA

Thoppilan, Romal, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin et al. “LaMDA: Language Models for Dialog Applications” arXiv preprint arXiv:2201.08239 (2022).

The GPT3 equivalent from Google. Interesting use of filters and tools to adapt DNN output. But still no sign of general AI.

Representation
LLM

Thrush 2024— Clever Testing

Thrush, Tristan, Jared Moore, Miguel Monares, Christopher Potts, and Douwe Kiela. “I am a Strange Dataset: Metalinguistic Tests for Language Models” arXiv preprint arXiv:2401.05300 (2024).

This is a toy paper. Why not focus on EASY and ERRORS instead of trying too hard to be clever? This paper ends up being just silly.

AI-Philosophy

Udandarao 2024— Not Concepts

Udandarao, Vishaal, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H.S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge. “No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance” arXiv preprint arXiv:2401.05300 (2024).

Getting to the why of more. This is an OK paper, but it is about words, not concepts. These authors seem to have no concept of what a concept is.

AI-Philosophy
LLM

Udrescu 2020— AI Feynman

Udrescu, Silviu-Marian, Max Tegmark. “AI Feynman: a Physics-Inspired Method for Symbolic Regression” arXiv preprint arXiv:1905.11481 (2020).

Crazy. Interesting use of NN to find simplicity for physics equations. NN vs GA battles.

Representation

Vafa 2024— Rebuilding Maps: World Models Are Not Great

Vafa, Keyon, Justin Chen, Ashesh Rambachan, Jon Kleinberg, Sendhil Mullainathan. “Evaluating the World Model Implicit in a Generative Model” arXiv preprint arXiv:2406.03689 (2024).

An overly formal approach to understanding what a generative model builds as a world representation. Mapping generation shows holes. It’s all in the evaluation!

Representation

Vaswani 2017 — Transformer Architecture

Jetley, Saumya, Nicholas A. Lord, and Phillip H.S.Torr. “Attention is All You Need.” 31st Conference on Neural Information Processing Systems. 2017.

Transformer architecture explained. Evidence of architectural decisions driven by computational efficiency. The mother of LLM transformers.

LLM
Engineering

Vendrow 2025 — Benchmarks

Vendrow, Joshua, Edward Vendrow, Sara Beery, Aleksander Madry. “Do Large Language Model Benchmarks Test Reliability?” arXiv preprint arXiv:2502.03461 (2025).

Pretty stupid. Focusing in on minor problems with trees does almost nothing to advance the forest conversation. Boiled down: calling something something does not make it something.

Benchmarking

Verdon 2010 — Brain lesion and consciousness

Verdon, Vincent, Sophie Schwartz, Karl-Olof Lovblad, Claude-Alain Hauert, and Patrik Vuilleumier. “Neuroanatomy of hemispatial neglect and its functional components: a study using voxel-based lesion-symptom mapping.” Brain 133, no. 3 (2010): 880-894.

Interesting study of brain lesion caused by stroke. Some tangential relationship to attention (and, in our view, consciousness).

Neurophysiology

Vigna 2019— Randomness

Vigna, Sebastiano. “It Is High Time We Let Go Of The Mersenne Twister” arXiv preprint arXiv:1910.06437 (2019).

Randomness (in this case stochastic) really matters for NNs. Note that cryptographic randomness is something else entirely.

Randomness

Villalobos 2022— Data Ocean too Small

Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, Anson Ho. “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning” arXiv preprint arXiv:2211.04325 (2022).

Completely relevant to the age of data feudalism. A back of the envelope set of calculations WRT data sources and LLMs. Fun read. This work does not account for dividing up the data ocean.

Data
LLM
Data Feudalism

Wachter 2020— UK Regulation

Sandra Wachter & Brent Mittelstadt*. “A Right To Reasonable Inferences: Re-Thinking Data Protection Law in the Age of Big Data and AI” Technical Report, Oxford Internet Institute (2020).

Posits that inferences and predictions that include private data or PII in input set should be protected. Very EU centric.

Policy
Regulation

Wallace 2020— Attacking Machine Translation

Eric Wallace, Mitchell Stern, Dawn Song. “Imitation Attacks and Defenses for Black-box Machine Translation Systems” arXiv preprint arXiv:2004.15015 (2020).

Attacking Machine Translation. 1. distill model by query (cloning), 2. use distilled version as whitebox, 3. a defense that fails. (Attacks BING and SYS-TRAN. Real systems!)

Attacks

Wang 2018 — Transfer Learning Attacks

Wang, Bolun, Yuanshun Yao, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. “With great training comes great vulnerability: Practical attacks against transfer learning.” In 27th {USENIX} Security Symposium ({USENIX} Security 18), pp. 1281-1297. 2018.

Attacks against transfer learning in cascaded systems. If the set of all Trained networks is small, this work hold water. “Empirical” settings. Some NNs highly susceptible to tiny noise. Good use of confusion matrix. Dumb defense through n-version voting.

Attacks
Transfer Attacks

Wang 2020— Stealing Hyperparameters

Wang, Binghui and Gong, Neil. “Stealing Hyperparameters in Machine Learning” arXiv preprint arXiv:1802.05351 (2020).

Fairly trivial, poorly motivated threat model. The notion of getting “free cycles” is not too impressive. Good history overview of ML security.

Attacks
Sponge Attack

Wang 2024 — Chain of Thought

Wang, Xuezhi and Denny Zhou “Chain-of-Thought Reasoning Without Prompting” arXiv preprint arXiv:2402.10200 (2024).

This is a very silly paper. Repetition through parroting is NOT reasoning. The temperature variable in LLMs already does this statistically.

Chain of Thought
LLM

Wang 2021— EvilModel: Hiding Malware Inside of Neural Network Models

Wang, Zhi, Chaoge Liu, and Xiang Cui. “EvilModel: Hiding Malware Inside of Neural Network Models.” arXiv preprint arXiv:2107.08590 (2021).

This paper is not really worth reading: it’s silly. The idea that you can hide malware in big globs of stuff is not surprising. In this case, the globs are DNN neurons (yawn). Basic steganography.

Attacks

Wang 2022— Against Predictive Optimization

Wang, Angelina, Sayash Kapoor, Solon Barocas, and Arvind Narayanan. “Against Predictive Optimization: On the Legitimacy of Decision-Making Algorithms that Optimize Predictive Accuracy.” Available at SSRN (2022).

How to diagnose serious problems with ML at the philosophy level with no engineering solutions offered. Seven reasons to throw stuff out. Not very productive.

AI-Philosophy
Bias

Wei 2022— LLM Emergent Capabilities Do Not Exist

Wei, Jason, Yi Tay, Rishi Bommansani, Colin Raffel, Berret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H Chi, Tatrsunori Hashimoto, Oriol Vonyals, Percy Liang, Jeff Dean, William Fedus. “Emergent Abilities of Large Language Models.” arXiv preprint arXiv:2206.07682 Published in Transactions on Machine Learning Research (08/2022).

Non-standard definition of emergence (a proxy for surprize) makes this paper very misleading from a cognitive perspective. The benchmarks are an anthropomorphic mess.

Bias
LLM

Wei 2023— Google Nonsense

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv preprint arXiv:2201.11903 (2023).

Credulous use of anthropomorphic language to describe association chains. Very few actual trials = anecdotal work.

Bias
LLM
Chain of Thought

Weidinger 2021— DeepMind on ethical/social risks of harm from Language Models

Weidinger, Laura, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng et al. “Ethical and social risks of harm from Language Models.” arXiv preprint arXiv:2112.04359 (2021).

OK explanation of potential harms written for laypeople, but in the end no new or insightful material.

Bias
LLM

Witty 2019— Causal Inference

Witty, Sam, Alexander Lew, David Jensen, Vikash Mansinghka. “Bayesian causal inference via probabilistic program synthesis” arXiv preprint arXiv:1910.14124 (2019).

Interesting paper about a toy problem. Not much of a tutorial. Doesn’t really stand on its own…so more of a teaser.

AI-Philosophy

Wu 2018— Sparse Distributed Memory and Generation

Wu, Yan, Greg Wayne, Alex Graves, and Timothy Lillicrap. “The kanerva machine: A generative distributed memory.” arXiv preprint arXiv:1804.01756 (2018).

A hybrid model based on Kanerva’s SDM. This paper is so dense as to be inscrutable. The idea of hybrid memory/perception models is a very good one. This paper is not the best way to start.

Representation
Cognitive Psych
AI-Philosophy

Wu 2024 — Explainability

Wu, Zhengxuan, Atticus Geiger, Thomas Icard, Christopher Potts, Noah Goodman. “Interpretability at Scale: Identifying Causal Mechanisms in Alpaca.” 37th Conference on neural information processing systems (NeurIPS). 2023.

Explainability work. A nice little experiment on a tiny toy problem that makes claims slightly too big for its britches. Suffers from big locality assumption that goes unsaid.

Explainable AI
Representation

Xu 2023 — Synthetic Data by LLM for LLM

Xu, Can, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang. “WizardLM: Empowering Large Language Models to Follow Complex Instructions.” arXiv preprint arXiv:2304.12244 (2023).

Using synthetic data created and curated by an LLM to make a new LLM. Use self to build self and evaluate self. It’s all very self serving.

Synthetic Data
LLM
Evaluation

Yuan 2018 — Adversarial Examples

Yuan, Xiaoyong, Pan He, Qile Zhu, Xiaolin Li. “Adversarial Examples: Attacks and Defenses for Deep Learning.” arXiv preprint arXiv:1712.07107 (2018).

A solid paper with a stunning set of references. A good way to understand the adversarial example landscape.

Attacks
Adversarial Examples

Yuan 2018 — Thieves’ Cant

Yuan, Kan, Haoran Lu, Xiaojing Liao, and XiaoFeng Wang. “Reading Thieves’ Cant: Automatically Identifying and Understanding Dark Jargons from Cybercrime Marketplaces.” In 27th {USENIX} Security Symposium ({USENIX} Security 18), pp. 1027-1041. 2018.

Train up separate embedded training sets that share projection into statistical space. Use blink testing. Bonus hilarious drug terms.

Representation

Zeng 2024 — Bias

Zeng, Zhiyuan, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, Xipeng Qiu. “Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective.” arXiv preprint arXiv:2412.14135 (2024).

Chinese outsider view of o1 reinforcement learning. Credulous bordering on silly.

Reinforcement Learning

Žliobaité 2015 — Bias

Žliobaité, Indré. “A survey on measuring indirect discrimination in machine learning.” arXiv preprint arXiv:1511.00148 (2015).

Not that insightful. Intro level.

Bias

Zhang 2021 — Generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding Deep Learning (Still) Requires Rethinking Generalization.” Communications of the ACM 64, no. 3 (2021): 107-115.

Cool set of basic experiments probing generalization. Still no real insight.

Engineering

Zhu 2020 — Adversarial Example Risk and Generalization

Zhu, Sicheng, Xiao Zhang, and David Evans. “Learning adversarially robust representations via worst-case mutual information maximization.” In International Conference on Machine Learning, pp. 11609-11618. PMLR, 2020.

An excellent treatment of adversarial examples and their relationship to representation. Lots of math. Aiming in exactly the right direction. We wish we had written this one.

MLSec
Attacks
Adversarial Examples

Zou 2023 — Security Philosophy: Suffix Prompt Injection

Zou, Andy, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson. “Universal and Transferable Adversarial Attacks on Aligned Language Models.” arXiv preprint arXiv:2307.15043 (2023).

Suffix-based prompt injection is really interesting. This is once again “empirical” work without much science behind it. Security philosophy is not well thought out.

Attacks

Videos and Popular Press

MIT Technology Review: Hundreds of AI tools have been built to catch covid. None of them helped.

This article emphasizes that ML is not magic (as approached through covid diagnosis). Turns out that data quality is a major issue in this reporting, with duplicates, meta-data, and distributed data sets all playing a role in systemic failure.

Ian Goodfellow on Adversarial Examples

Lots of solid thinking about the subject

James Mickens of Harvard University at the 27th Usenix Symposium

Q: Why Do Keynote Speakers Keep Suggesting That Improving Security Is Possible?
A: Because Keynote Speakers Make Bad Life Decisions and Are Poor Role Models

James Mickens, Harvard University
27th Usenix Security Symposium

Ali Rahimi’s talk at NIPS — Test-of-time Award Presentation

Joshua Tenenbaum on Triangulating Intelligence, Sessions 2 & 3

Counterfeiting Humans: A Conversation with Daniel Dennett and Susan Schneider — Interesting views on intention, consciousness, GAI, and ML.

“Ingredients of Intelligence” – Brenden Lake explains why he builds computer programs that seek to mimic the way humans think.

Douglas R. Hofstadter, “The Shallowness of Google Translate” from the Atlantic Monthly

François Candelon, Rodolphe Charme di Carlo, Midas De Bondt, and Theodoros Evgeniou, “AI Regulation Is Coming” from the Harvard Business Review

Useful for understanding how some people use sloppy thinking about math to make points about ML that are nonsensical. XAI is harder than this article lets on.

Shalini Kantayya, “Coded Bias” from PBS Independent Lens

Ross Anderson on Security Engineering and Machine Learning from ICSA Colloquium (SPT Seminar July 17th, 2021)

The lecture is elegant and clear. It explains some of the sponge attacks that Ross Anderson’s group has uncovered. There is also some interesting discussion of “manners” (and standard versus anomalous behavior near the end.

Gary Marcus on why “Deep Learning is Hitting a Wall” from Nautilus

Gary Marcus on why deep learning is not quite all it’s cracked up to be. General AI is far far away.

Artificial Intelligence at Google: Our Principles

Platitudes from Google. Let’s hope this philosophy fares better than, “don’t be evil.”

Bargav Jayaraman and David Evans on Evaluating Differentially Private Machine Learning in Practice at the 28th USENIX Security Symposium

A Usenix security presentation on differential privacy. Good relevance to issues of representation.

The Atlantic: ‘A Perfect and Beautiful Machine’: What Darwin’s Theory of Evolution Reveals About Artificial Intelligence by Daniel C. Dennett

This pithy little article is probably approximately correct and includes the great hook, “competence without comprehension.”

Stanford University Human-Centered AI 2023 Artificial Intelligence Index Report

This article purports to be a thorough accounting of AI goings on in 2022. From a meta-perspective, it is interesting that there is not much about security in the 386 pages unless you count the ethics chapter (which is mostly about bias). There is some good information about benchmark curation, but it is spun towards the positive. What about verification? Explainability? Trustworthiness?

Douglas Hofstadter in The Atlantic– Gödel, Escher, Bach, and AI

An excellent view of LLM production as seen by a top cognitive scientist.

Did ChatGPT cheat on your test?

Authors: Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Eneko Agirre describe why “state of the art” tests are not.

GPT-4 and professional benchmarks: the wrong answer to the wrong question

OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots. By Arvind Narayanan and Sayash Kapoor

The LLM Reasoning Debate Heats Up (October 2024)

Melanie Mitchell discusses three recent papers examine the robustness of reasoning and problem-solving in large language models. We added all of the papers to our list of reading.