"Datawork is deeply human," asserts Adriana Alvarado, Staff Research Scientist at IBM, a statement that cuts directly to the often-overlooked core of artificial intelligence development. Her presentation, "LLM + Data: Building AI with Real & Synthetic Data," illuminated the intricate relationship between Large Language Models (LLMs) and the data that underpins them. Alvarado underscored that while AI's capabilities continue to evolve at a breathtaking pace, the quality and characteristics of its foundational data remain paramount, demanding a human-centered approach often obscured by technical jargon.
Every AI model, from the simplest algorithm to the most sophisticated LLM, begins and ends with data. This fundamental truth means that the choices made during data collection, curation, and preparation are not merely technical steps but critical decisions that profoundly influence an AI system's ultimate performance, fairness, and utility. The rapid ascent of LLMs has only amplified this dependency, positioning data as the undisputed engine behind chatbots, generative AI, and countless other emerging technologies shaping our digital future.
Alvarado argues that the entire lifecycle of data—from its initial collection and annotation to its eventual deployment and iterative refinement—is a process she terms "datawork." This concept highlights the continuous, often painstaking, day-to-day effort involved in producing, managing, and utilizing data, an effort that is "deeply human." For founders and VCs investing in AI, understanding datawork means recognizing that the raw material of AI is not a static resource but a dynamic, human-shaped construct. Despite its undeniable value, datawork is frequently overlooked, undervalued, and sometimes even considered invisible within the broader AI development landscape. Yet, it is precisely these human-driven decisions, often involving complex social, ethical, and technical considerations, that ultimately dictate how AI systems function and perform in the real world. "Every single decision that they make has a downstream effect on model performance," Alvarado stresses, underscoring the direct and profound link between human choices in data handling and the operational integrity of AI products and services.
The human element in datawork also introduces significant challenges, particularly concerning inherent biases. When practitioners make choices about categorizing or labeling data for a dataset, they are implicitly deciding "who gets to be represented and who doesn't." This selection process, often influenced by the perspectives and biases of the data curators themselves, is rarely neutral. Consequently, Alvarado points out that "most of the datasets used to train AI systems currently do not represent the world equally." They tend to over-represent certain regions, languages, or cultural perspectives while under-representing others, creating substantial gaps in how models respond to diverse questions or interact with varied user groups. For startup leaders and tech insiders, this means that even the most technically advanced LLM, if trained on skewed data, risks perpetuating existing societal inequalities, leading to AI systems that are less equitable, less effective, and potentially harmful when deployed in sensitive applications or global markets. Addressing this requires not just technical fixes, but a conscious, human-centered approach to data curation that prioritizes fairness and inclusivity from the outset.
