Research Proposal: Deep Reinforcement Learning for EV Charging Scheduling

This summer, under the supervision of Dr. Zohaib Akhtar, I am exploring a question at the intersection of machine learning and energy systems: can a reinforcement learning agent learn to schedule EV charging more adaptively than fixed rule-based strategies?
Research Proposal: Deep Reinforcement Learning for EV Charging Scheduling
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Project Summary

Unmanaged EV charging concentrates load during already-stressed grid periods, raising costs for charging operators and contributing to local grid congestion. Existing approaches, like charging at full power on arrival, or restricting charging to fixed off-peak windows, handle individual objectives reasonably well but cannot adapt to a constantly changing mix of vehicles with different deadlines, state of charge, and urgency.

This project investigates whether a deep reinforcement learning agent can learn a scheduling policy that reduces electricity cost and peak demand at a public EV charging station, without compromising on deadline satisfaction. The station is modelled with up to 8 simultaneous EVs, the agent operates at 30-minute timesteps and must allocate continuous power across active slots in response to live price signals and vehicle state. The research question is: can a SAC agent outperform rule-based heuristics across all three objectives simultaneously?

Methodology

A custom Gymnasium environment simulates the charging station using stochastic arrivals parameterised from the ACN-Data dataset (Caltech, 2018–present). Electricity pricing follows the Octopus Agile half-hourly tariff, reflecting the cost structure a UK charging operator would actually face.

At each timestep the agent receives an observation vector of per-slot battery levels, time-to-departure, and current spot price, then outputs a continuous power allocation. The reward penalises electricity spend and applies a deadline-miss penalty for any vehicle that departs undercharged.

A SAC (Soft Actor-Critic) agent is trained for around 500K timesteps via Stable-Baselines3. SAC is chosen over PPO for its off-policy sample efficiency and suitability for continuous action spaces. After training, the agent is frozen and evaluated on 30 held-out test days across 3 random seeds, and compared against two baselines run under identical conditions: uncontrolled charging (full power on arrival) and time-of-use charging (charging restricted to fixed off-peak windows).

Analysis covers electricity cost, deadline satisfaction rate, peak power, and peak-to-average ratio. An ablation retrains the agent without the price signal to isolate how much cost reduction comes from price-responsive behaviour specifically. The agent's learned policy will be visualised using methods like heatmaps.

Tools: Python, Gymnasium, Stable-Baselines3, NumPy, Pandas, Matplotlib, TensorBoard.

Expected Outcomes

The SAC agent should achieve a better trade-off than rule-based heuristics, so an agent that defers low-urgency vehicles to cheaper windows should reduce cost without much sacrifice on deadline satisfaction. The time-of-use baseline is a more interesting comparison: it will likely perform well on cost but struggle with short-stay sessions, where the fixed off-peak window doesn't align with a vehicle's departure time. The agent should handle these more gracefully by responding to individual urgency rather than a fixed schedule.

The ablation should confirm the cost reduction is coming from price-responsive behaviour, if removing the price signal has little effect, the reward shaping needs revisiting.

The most meaningful result won't be the headline cost figure but the policy visualisation. If the agent is learning sensible behaviour, it should allocate more power during cheap periods and throttle back during price spikes, with urgency overriding price as departure approaches. That would be the clearest evidence the agent is solving the problem as intended rather than just fitting the test days.

Potential Impact

Unmanaged EV charging concentrates load at peak hours, straining local distribution networks and pushing up electricity prices for everyone, including people who don't own an EV. Smarter scheduling is one part of the answer, and it doesn't require new hardware, just better software decisions about when each car actually needs to charge.

This project is a small-scale simulation, but companies like Octopus Energy and Pod Point are already building the aggregated charging platforms where this kind of learned scheduling becomes relevant at scale. In that sense, machine learning for EV charging is not just a technical optimisation problem, but part of how the energy transition can be made cheaper, fairer, and more reliable.

Please sign in

If you are a registered user on Laidlaw Scholars Network, please sign in