⛓️🧠 Thoughts on chain of thought

As a data scientist, I have spent many years tuning machine learning models by adjusting hyper-parameters, analyzing metrics, creating learning schedules, and even experimenting with initializations and dropout rates to assess the model's performance and ability generalize. However, with the introduction of multi-billion parameter large language models and the field of prompt engineering, I now find myself enjoying an entry point to machine learning that feels more like teaching than tuning.

Thinking before you speak

Prompt engineering involves crafting carefully constructed inputs, or prompts, condition the response of a model which can encourage a generic language model to perform a specific task. Rather than just optimizing for a specific performance metric (like predicting the price of a house, or if a tweet is mean). These large language models can even attempt to reason and explain their decisions by using common sense and real-world knowledge (when given access) to provide more nuanced and sophisticated responses.

Large Language Models are Zero-Shot Reasoners
Large Language Models are Zero-Shot Reasoners

My untested intuition draws a parallel to optimization. Auto-regressive language models generate new tokens based on previous ones, answering the question "what should I say next given what I said?" By requesting the model to generate "thoughts" first, we encourage it to "think before speaking," generating higher-order information before attempting to arrive at the correct answer. If speaking one's thoughts is analogous to gradient descent, then CoT (thinking before speaking) is similar to approximating momentum for a few cycles before taking the next step.