An Exploration into Reinforcement Learning

June 7, 2026

I've been looking into and learning a lot about Reinforcement Learning lately — which is a method of Machine Learning which doesn't rely on the copying of original human work. Modern "AI" systems like LLMs as well as the various image/video generation tools both rely on a Machine Learning technique known as "Structured Learning". In Structured Learning, the agent is provided with a (usually quite large) corpus of labeled data: a set of inputs (like prompts, for example), and the outputs the machine is expected to produce in that scenario (such as a response, or an image).

Structured Learning is entirely dependent upon the sum of all that is in order to produce results, just by its very nature. These types of machines can never engage in a single act of creation. They can never make something new, or discover something that they weren't directly shown during training. As it happens, they're also quite bad at learning in general. They shut their minds off entirely once they're no longer in training. Once deployed, they cannot learn. In the case of an LLM, it might be able to simulate learning by retaining something you told it in its context window, but once the conversation gets long enough, that info will be dropped, and it never gets integrated into the model itself. It has no ability to learn new things in a long-term way once its no longer training.

Contrast that with the way Reinforcement Learning (RL) works. An RL agent has no substantive distinction between training and deployment. Sure, you can go in and artificially disable learning after you feel you've "got it right" (especially if you're deploying into a high risk application and don't want it to act unpredictably down the road), but it's not the natural state of things for it to do that. RL agents are always learning, and they learn through experience, not by being fed a list of "right" and "wrong" answers.

This simple fact seems (to me) like a very good argument for saying that Reinforcement Learning is the only way forward for AI systems. I also happen to feel as though this kind of system would be much more of an ethical technology than the current fad of AI tech being sold to us right now. For one thing, it wouldn't require Big Tech companies to collect all kinds of data on us as a means to train them. Additionally, the machine wouldn't be regurgitating remixed versions of the creative outputs of artists, authors, musicians and the like. It's not a technology that suited to mimicry. Reinforcement Learning AIs aren't really capable of becoming plagiarism bots, at least in the sense that it wouldn't be reproducing the work of others without providing citations.^[1] It's also probably not going to be very good at language for quite a long time. After all, language is merely the expression of a number of deep, very complicated intellectual processes that are going on inside us. It's a tool for communicating those intellectual concepts to one another. LLMs are only able to get away with language because they took the easy way out. They skip the thinking, and go straight to reproducing words in orders that seem right based on their training data. LLMs are not aware of the words they're typing. It's the Chinese Room thought experiment. LLMs are given a set of tokens (numbers), which correspond to words or half-words. Your prompt is a string of numbers to it, and it responds with the string of numbers back that seems right.

If a Reinforcement Learning agent were to perform language, it would not do so like this. It would learn language as a tool for communication regarding whatever task was at hand. It would learn to communicate in the same way that biological organisms in nature did.^[2] The agents would exist in the world, and then slowly develop an ability to communicate threats or describe resources. Over time, the language would become more sophisticated. Having it learn a human language like English or what have you might be more difficult of a task, although I don't see any reason why it couldn't be persuaded to use noises humans understand instead of whatever medium it would have done otherwise.

I hope so far that I've convinced you of the ways that Reinforcement Learning is uniquely separate from what we've lately come to understand "AI" to be. We think of AI today to be lazy, plagaristic slop-bots, that steal work from real humans, memorizing data in their weights. We also think of it as being unoriginal, and incapable of dealing with a novel problem in a novel way. It is only able to solve new problems by applying old methods. This is all true of the short-sighted approach that current offerings have taken. But there is a whole field of Machine Learning which seeks to produce machines that are not like this.

I would also like to say right here up front that Reinforcement Learning doesn't seem to me (a very much non-expert) to be well-suited to the task of doing difficult things like writing essays or screenplays or other tasks that our bosses seem to be frothing at the mouth at, in hopes of firing all of the worlds workforce. I (again, just an undergraduate college student) feel like this kind of tech would be way better as a tool for handling truly boring, data-driven tasks. It might one day become capable enough to maybe perform tasks that require certain amounts of very basic intuition, but even that seems to be a stretch, given the enormous computational constraints we're already starting to discover with LLMs. I don't think a sober economy is going to be willing to put up a massive data center in every city in America to fund a technology that isn't sellable as "we're building god". In any case, I do find this field to be really interesting. I think it would be fun to make some little digital robot creatures that could actually do interesting stuff, instead of just kind of ruining the internet.

So how does it work?

Well, that was a long winded way for me to introduce the concept, and defend it against some (possibly quite reasonable) ethical concerns. I hope I've done it justice. In this section, I'd like to go over a bit of how it actually works. Fair warning, there's gonna be a bit of computer math involved, but I hope you'll be able to push through it — I find this stuff really quite interesting.

A lot of my knowledge on this topic comes from Richard Sutton's Reinforcement Learning: An Introduction textbook, which has a free PDF you can legally share with your friends, thanks to being licensed under CC BY-NC-ND 2.0. I hope that this blog post is much more to the point than the textbook, with less technical detours. I'm no expert, so take everything with a grain of salt, but all of what I'm talking about I have successfully implemented for a simple agent (which I discuss at the end).

Guessing rewards

The basic principle of Reinforcement Learning (in my opinion) is an algorithm that tries to guess the reward it will receive for taking various actions available to it. Rewards are expressed as a numerical value. The most intuitive way to try to guess rewards is to just keep track of an average of rewards you've received for each action. Use the past as a way to try to predict the future. If we define a Quality function $Q(a)$ to represent our predicted reward we will receive for taking some action $a$ , we can express it as:

$Q(a) = \frac{\text{(sum of rewards)}}{\text{(\# of attempts)}} = \frac{\displaystyle\sum_{i=1}^{N} R_i}{N},$ where $R_i$ is the reward received at some step $i$ .

However, there is a slightly easier way to handle calculating a rolling average. When taking one data point $R_i$ at a time, and trying to calculate the average $Q(a)$ after some number of data points $n$ , you can use the following assignment^[3]:

Q_\text{new}(a) = Q_\text{old}(a) + \frac{1}{n+1} [R_i - Q_\text{old}(a)]

Where $n$ is the number of datapoints included in $Q_\text{old}(a)$ . You'll notice that if there have been no datapoints yet (i.e., $n=0$ ), then the value of $Q_\text{old}(a)$ has no impact on $Q_\text{new}(a)$ . In the literature, this assignment is often rewritten further as^[4]:

Q(a) \leftarrow Q(a) + \alpha [R_i - Q(a)]

With the assumption^[5] that $\alpha = \frac{1}{n+1}$ . As it would happen, this single, simple seeming equation actually forms basically the entire basis of Reinforcement Learning. We still have no mechanism for actually picking actions, but this $Q$ function is our way of ranking actions against each other. More complexity is eventually introduced to this function in order to account for cause-and-effect, but for now, we can operate with just this.

Example: K-Armed Bandit

Suppose you're a Reinforcement Learning agent. In front of you are a number of slot machines ("one-armed bandits"). Each slot machine returns a random value centered on a given (unknown) setpoint. Some machines will have higher setpoints than others. You want to choose a course of action such that you maximize your long-term yield from the machines after some number of turns. You can only pull the arm of one machine at a time. We refer to this collection of machines as a "K-armed bandit", where $k$ represents the number of machines total. Basically, the strategy you want to employ is to discover the machine with the highest setpoint, so as to maximize your expected yield.

Let's ignore for now how you would pick an action, and instead just focus on how you might keep track of your $Q(a)$ function.

You start by initializing all your $Q$ values at any arbitrary value (it will be overwritten soon, since $n=0$ ). Then, when you pull on some machine $a=0,1,2,\ldots$ you update your quality function by including that reward $R$ from action $a$ in your average for $Q(a)$ . Here's how that would look as an algorithm:

$\begin{aligned} &\text{//initialize values}\newline &\text{for }a = 0, 1, 2, \ldots, k: \newline &\quad Q(a) \leftarrow 0\newline &\quad N(a) \leftarrow 0\newline &\text{//start playing}\newline &\text{loop:}\newline &\quad a \leftarrow \text{(pick some action)}\newline &\quad R \leftarrow \text{takeAction\_getReward}(a)\newline &\quad \text{//increment }N\text{ for this action}\newline &\quad N(a) \leftarrow N(a) + 1\newline &\quad \text{//update the new running average of rewards for }a\newline &\quad Q(a) \leftarrow Q(a) + \frac{1}{N(a)}[R - Q(a)] \end{aligned}$

Picking an action

The only thing we really have to do after getting an up-to-date guess about the value of each action $a$ is to determine which one to pick. In most circumstances, we should just pick whatever is highest. In the literature, they refer to this as "exploiting" your prior knowledge/experience, and it's a good thing to do. However, if you exploit at all times even when you've only just started out, you run the risk of settling into a sub-optimal routine. Something that's good, but not the best possible.

Better than this would be to exploit most of the time, and then picking a random action with some very small probability $\varepsilon$ , where $0<\varepsilon<1$ . This taking of a random action is called "exploring". We can then pick our action based on the following:

\text{action} = \begin{cases} \text{(Exploit), with } p = 1-\varepsilon \newline \text{(Explore), with } p = \varepsilon \end{cases}

So that within our code, we can assign our action as:

\text{action} \leftarrow \begin{cases} \text{argmax}_a Q(a)\text{, with } p = 1-\varepsilon \newline \text{random(A), with } p = \varepsilon \end{cases}

Where $\text{argmax}_a Q(a)$ simply returns the $a$ such that $Q(a)$ has the largest value. If $Q$ is an array, and $a$ is the index corresponding to the value, then $\text{argmax}_a Q(a)$ just returns the index of the largest value in the array. Additionally, $\text{random(A)}$ returns a random action in the set of all possible actions, $A$ . So to update our k-armed bandit algorithm, we can express it as:

$\begin{aligned} &\text{//initialize values}\newline &\text{for }a = 0, 1, 2, \ldots, k: \newline &\quad Q(a) \leftarrow 0\newline &\quad N(a) \leftarrow 0\newline &\text{//start playing}\newline &\text{loop:}\newline &\quad a \leftarrow \begin{cases} \text{argmax}_a Q(a)\text{, with } p = 1-\varepsilon \newline \text{random(A), with } p = \varepsilon \end{cases}\newline &\quad R \leftarrow \text{takeAction\_getReward}(a)\newline &\quad \text{//increment }N\text{ for this action}\newline &\quad N(a) \leftarrow N(a) + 1\newline &\quad \text{//update the new running average of rewards for }a\newline &\quad Q(a) \leftarrow Q(a) + \frac{1}{N(a)}[R - Q(a)] \end{aligned}$

You'll notice that the only undefined function now is the $\text{takeAction\_getReward}(a)$ function. This function will be wired directly into the agent's external environment. It is responsible for making the agent actually pull the levers. In more complicated applications, it may be the function responsible for actually sending electrical signals to the hardware of a robot, or for sending APIs in order to take actions in an online chess or poker game.

Rewards

With what we have already, we've basically already defined our agent completely. However, outside of the agent, we do still need to define how our rewards will be given to the agent. It's important to realize about these rewards is that they aren't just helpful clues for the agent. The reward signal is, in fact, the sole thing that the agent will ever care about. Hook up the rewards to something you want it to do, and you'll be pleased with the results. Hook it up to something that it can find a loophole to exploit without accomplishing your task, and you will find yourself disappointed. I feel I've already over extended myself in terms of technical depth in this post, so I won't be giving any specific pointers on how to choose reward drops, but take comfort in the fact that this has hardly anything to do with difficult math, which is a nice contrast compared to the previous two sections.

Gridworld Proof-of-concept

So in learning this, I decided to actually code something up that implements the Reinforcement Learning algorithm. I created a simple simulated world in which an agent is able to control an "animal". The animal can move in 4 directions. The animal is given state information $s$ based on the tiles that are nearby it within a given radius. Most tiles are empty, but some have food. If the animal moves to a food tile, the food is consumed, and the agent receives a reward. At each step where the agent moves somewhere and doesn't get food, they receive a very small negative reward. As the animal moves around it runs out of energy, which is regained by eating food. If the animal runs out of food completely, it dies. If the animal passes a certain threshold of high energy, it reproduces, losing half of its energy to the baby. The child then has a clone of the parent's agent placed into it (as well as a copy of the parent's quality function $Q$ ). As time goes on, food is spawned into the world.

Since this example actually has state information, we need to keep track of that in our quality function as well. This is actually quite simple, as it turns out. We just go from using $Q(a)$ to using $Q(s, a)$ , where $s$ is the state we are in. This increases the problem space in a big way, but all the same methods are still perfectly workable, as long as we're okay waiting a while for it to learn by trying things.

Since in gridworld, one state leads into another, causing consequences down the road, we also had to introduce an estimate, $R_\Sigma$ , for how much reward we think we'll achieve long-term following an action. This estimate is is based on our current values for the quality function, $Q$ , meaning that as we improve the accuracy of our quality function, we'll also improve the accuracy of $R_\Sigma$ . So as not to over-value the less immediate rewards, we introduce a discounting factor, $0 < \gamma < 1$ , so that we could define:

R_\Sigma = R_t + \gamma\text{max}_a Q(s_{t+1}, a_{t+1}) \approx R_{t} + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... = \displaystyle{\sum_{i=0}^{\infty} \gamma^i R_{t+i}}

All this really does is basically say that "even though literally taking random actions will eventually result in getting infinite rewards (after infinite time), we would actually prefer sooner rewards over later rewards." Basically, every subsequent reward is less valuable to us than the one we're about to receive (since $0 < \gamma < 1$ ). Your choice of $\gamma$ is a matter of preference. Tweak it as needed.

For ease of syntax, we can define $S$ as the set of all possible states, and $A$ the set of all possible actions (in our case, $A = \text{\{up, down, left, right\}}$ ). Our algorithm for gridworld is:

$\begin{aligned} &\text{//initialize values}\newline &\text{for each }a\in A, s\in S: \newline &\quad Q(s, a) \leftarrow 0\newline &\quad N(s, a) \leftarrow 0\newline &\text{//start playing}\newline &\text{loop:}\newline &\quad s \leftarrow \text{getState}()\newline &\quad a \leftarrow \begin{cases} \text{argmax}_a Q(s, a)\text{, with } p = 1-\varepsilon \newline \text{random(A), with } p = \varepsilon \end{cases}\newline &\quad R \leftarrow \text{takeAction\_getReward}(a)\newline &\quad \text{//}s'\text{ is our state after taking action }a\newline &\quad s' \leftarrow \text{getState}()\newline &\quad \text{//}R_\Sigma \text{ is our expected long term rewards for taking action }a\newline &\quad R_\Sigma \leftarrow R + \gamma\text{max}_a Q(s', a)\newline &\quad \text{//increment }N\text{ for this action}\newline &\quad N(s, a) \leftarrow N(s, a) + 1\newline &\quad \text{//update the new running average of rewards for }a\newline &\quad Q(s,a) \leftarrow Q(s, a) + \frac{1}{N(s, a)}[R_\Sigma - Q(s, a)] \end{aligned}$

This new algorithm requires us to define a new function $\text{getState}()$ which returns the state in some format that you believe to be most appropriate for the agent to digest. This state will usually be a vector (a list of numbers). You'll notice that very little changed really. The overall structure of the algorithm remains the same for state or stateless agents. Of course, very few stateless problems need to be solved with Reinforcement Learning, and so we fix our gaze on state-based problems.

You'll also notice that since the actions available to us remains the same regardless of the state, we didn't need to specify $s$ when referring to $A$ . If that weren't the case, we would need to refer to our set of possible actions as $A_s$ , updating our algorithm accordingly.

See it in action

So, let's get a look at how this thing works.

All the code for the project is available from the GitHub repo. If you'd like, you can download the code and then run it by pasting the following into your terminal:

git clone https://github.com/rseeber/gridWorld-rl
cd gridWorld-rl
python agent_Q.py

It's going to prompt you, asking how many training rounds you want it to run. Remember, these agents literally are acting entirely randomly before being trained, so pick a number large enough for them to properly learn their environment from experience. I've found that a number somewhere in the range of 10,000 or 50,000 tends to be enough for them to properly learn their environment.

This "training" stage is no different from the part afterwards, except that it doesn't show what's going on to the user until after the initial training stage.

Analysis

There's lots of behaviors that the agents engage in which raise certain questions — most often, "are they behaving this way because that's actually beneficial in this environment, or have they just not had enough time to fully learn that there's better ways to behave?" This is mostly a question when I see them doing inefficient things, like wasting movement going nowhere instead of heading towards food that they can see.

One possible explanation is that they've learned over time that going too aggressively for food results in explosions in population (since they cannot control when they reproduce — it occurs spontaneously once they've stored enough food energy), inevitably leading to overgrazing and a subsequent shortage of food for the parent.

Thus, the agent "decides"^[6] it would be more optimal to take a conservative strategy — wasting time, so as not to exceed the carrying capacity of the environment. It's interesting, since the behaviors are being determined based on both Natural Selection, but also based on the goals (which come from our reward function we set) of the agent. So the agent is evolving to behave in a way that results in less of what you might call "suffering"^[7] for the agent.

To my own limited understanding of Biology, this seems to me to be fairly unique. Since heritable traits (which I shall call "genes") are almost always passed down through DNA, it means that your genes cannot be pointed in the direction of behaving in ways that reduce suffering. They merely point towards behaving in ways that keep you and your family/tribe/species alive. The only exception to this would be the extent to which culture is also a method for passing down heritable traits. Humans are very strongly influenced by culture, as are many other animals to what is likely a lesser extent. Individuals within a population can choose to neglect inheriting a cultural trait if they feel that doing so would reduce their own personal suffering (even if it would have no effect on survival of the individual or the group). This is not a possibility with DNA, but it is a possibility in cultural inheritance.

These RL agents are then made entirely out of genes that can be modified in such ways as to reduce the "suffering" (I squirm to use the word a second time, I shouldn't hope to give the impression that these things are actually alive or aware of anything) of the agents. Individual agents can realize that certain behaviors (coded as genes) lead to less rewards than other options — despite the fact that neither option changes the survival rates for the group or individual. The agent adjusts their quality function following experience, and just like that, they update their genome to prefer higher rewards. The agents are able to evolve via the reward function, not just through Natural Selection.

Conclusion

Well, maybe that got a little philosophical there, but I hope the significance of that fact is maybe noteworthy to some people. Humans are capable of defying our own hardwired habits, but that doesn't mean it's always easy! Yet for these agents, at least in this particular environment, their main mechanism of evolution seems to be based on behaving in ways that maximize their goals (their reward function), not just what results in higher biological fitness.

So, to recap what we covered:

Reinforcement Learning is a experience-based method for doing Machine Learning.
At least in simpler applications, the basic strategy is to label the "quality" of various actions, or state-action pairs. That is, we decide what the quality is for taking some action $a$ , in a given state $s$ .
In order to avoid sub-optimal (but still fairly okay) paths, we occasionally decide to "Explore" a random path, rather than "Exploiting" our previous knowledge. We Exploit most of the time, but we occasionally Explore, with some probability $\varepsilon$ , where $0 < \varepsilon < 1$

There's of course a lot more complexity in the theory than what I showed off, for instance, I only described the basic details of how Q-learning and TD-learning work (which are the methods I used for the gridworld algorithm). If any of this was confusing (I'm sure some of it was), you can totally feel free to email me. I love talking about this stuff, and I'm sure it would also help me refine my own understanding!

I suppose there's nothing necessarily stopping a student from stealing the (original) work of an RL agent, and passing it off as their own. Though I'd argue that would be a lot less detrimental to society than what is currently happening: a student stealing the output of an LLM, which in turn was stolen from a number of different human authors. Though, I don't know we'll ever create RL-agents that can write essays anyways. ↩︎
I say organisms plural because humans are not, in fact, the only species to develop communication. We (might) be the most advanced at it, but many a species communicate with one another. Additionally, many communication calls are cultural, varying not by species, but by region. Birds of the same species in one part of the US have different mating calls than those in another part of the US. This would seem to indicate a kind of dialect or primitive "language" among animals (if you'll allow a slight abuse of the term). ↩︎
Where this equation comes from might be a bit mysterious to some readers. You can actually prove that it's true by starting at the first equation and slowly rewriting it until you get to this one. If you'd like some more intuition about this version, you can think of the $[R_i - Q(a)]$ as the "error correction" from your predicted value $Q(a)$ , and a datapoint $R_i$ , so that you're just incrementing your function by $\frac{1}{n+1}$ of that correction. ↩︎
Note that $a \leftarrow b$ is equivalent to a = b in popular modern programming languages like C or Python. ↩︎
By changing the value of $\alpha$ , we're actually able to achieve various results, such as introducing a recency bias to the agent by setting $\alpha$ to be a constant value. There are also various things you can do if you set $\alpha$ to be some function of $n$ other than $\alpha = \frac{1}{n+1}$ . ↩︎
Perhaps it's not the agent deciding. The way I'm thinking of it is in terms of the way the Evolutionary Biologist Richard Dawkins frames genetic evolution in The Selfish Gene. Here, the quality function would be our so-called "genetic code". Since this genome is passed down (perfectly identically) to children (but with the ability to mutate following the child's own unique experiences), I think it works in an analogous way. That is to say, maybe it's not the agent (the organism which carries genetic code) deciding to behave one way or another. Perhaps it is the genetic code deciding to code for inefficient paths, since those genes that code for excessive food consumption tend to result in the parent and children going hungry longer, and thus the gene either being overwritten, or else the organism starving to death, ending the entire bloodline. ↩︎
Not to say the agent is sentient or conscious in any meaningful way, but no word seems to exist for "experience a denial of what you want" that is meant to be used for non-sentient agents. ↩︎