Enhancing coagulation prediction in water treatment using a similarity score-based piecewise machine learning model

Authors: Jie Zhang, Noah Taylor, Wen Zhao, Ran Rui, Charlie He

Journal of Environmental Engineering

Water treatment plants typically rely on jar testing to determine optimal coagulant dosages, a time-intensive, manual process that doesn’t always respond quickly to changing water conditions. As utilities face increasing demands for efficiency and sustainability, new tools are needed to enhance decision-making.

A new study led by Carollo’s Jie Zhang, Noah Taylor, Wen Zhao, Ran Rui, and Charlie He is redefining how we approach coagulation in water treatment. According to the Journal of Environmental Engineering article, by applying machine learning (ML) to historical plant data, the research offers a more precise and automated way to predict coagulant demand and improve effluent quality. This approach has the potential to transform how water utilities operate.

Machine Learning Offers a Data-Driven Alternative

To explore ML’s potential, Carollo’s team used three years of data from a surface water purification plant in Texas. This data included more than 3,000 samples with values for raw water quality (like turbidity, TOC, pH), chemical dosing, and the resulting settled water characteristics.

Fourteen machine learning models were evaluated across two types of predictions. The first focused on predicting effluent quality using raw water quality parameters and coagulant dose as inputs. The second aimed to predict the required coagulant dose based on raw water quality and a specified target for effluent quality.

Why Input Similarity Matters in Model Accuracy

An important takeaway from the study was the impact of data splitting methods. Random data splits resulted in higher accuracy, but time-based splits, which more closely resemble real operational forecasting, produced less accurate predictions. This insight led to the introduction of a new metric: the input similarity score, which measures how closely incoming data matches the past conditions the model was trained on.

A Hybrid Model Designed for Real-World Conditions

To overcome performance limitations, the team created a piecewise hybrid model using three ML algorithms: Random Forest, Linear Support Vector Regressor, and Multilayer Perceptron. The model dynamically selects the best-performing algorithm based on the input similarity score, resulting in more reliable predictions across various operating conditions.

Bridging Machine Learning with Operations via Digital Twins

The final model was integrated into the plant’s digital twin using the ONNX framework, allowing real-time updates and intelligent recommendations. This reduces reliance on manual testing, improves process efficiency, and supports data-informed operations.

The Future of Machine Learning in Water Treatment

This research shows how ML can enhance core treatment processes like coagulation, offering utilities a path toward smarter, more adaptive operations. As ML tools become more accessible and integrated with plant systems, the potential for scalable impact is enormous.

Check out the full article for a deep dive into how machine learning aids in model architecture, evaluation metrics, and real-world results.

Citations

Zhang, Jie, et al. “Enhancing Coagulation Prediction in Water Treatment Using a Similarity Score–Based Piecewise Machine Learning Model.” Journal of Environmental Engineering, vol. 151, no. 6, June 2025, https://doi.org/10.1061/joeedu.eeeng-7969. Accessed 24 Apr. 2025.