Dalmo Cirne, Workday, Inc., USA
Pierce Buckner-Wolfson, Wesleyan University, USA
Protecting data privacy is a critical responsibility for application developers. However, without data, it becomes impossible to build entire categories of products. This paper proposes an innovative method for training Machine Learning (ML) models using only embeddings (derived data). Embeddings represent the original data as multidimensional vectors and, as such, can be plotted and clustered in hyperspace, enabling effective solutions for problems such as anomaly detection, search, sentiment analysis, recommendation, and graph prediction, without requiring access to the raw data. Using only derived representations of the data and their clustering patterns, this approach preserves privacy while allowing responsible application development. Many machine learning algorithms such as Neural Networks, Gradient Boosting, and K-Nearest Neighbors are well-suited tools, providing cost-effective and computationally efficient alternatives to Large Language Models (LLMs). The approach proposed here balances data-driven innovation that is compliant with strict privacy requirements while unlocking the space for the development of powerful applications.