From LLM Agents to Scientific Knowledge Graphs

The current generation of LLM-based research agents, while adept at orchestration, has largely failed to capitalize on the structured nature of scientific knowledge. Existing approaches often distill papers into superficial elements like abstracts and citation links, missing the granular details, entities, claims, evidence, mechanisms, and method lineages, crucial for robust scientific reasoning. This oversight represents a significant bottleneck in advancing AI's capability for scientific discovery.

Visual TL;DR. LLM Agents Limited leads to Bottleneck in Discovery. Bottleneck in Discovery addresses Agents-K1 Pipeline. Agents-K1 Pipeline uses Multimodal Parser. Multimodal Parser creates Agent-Native KGs. Agent-Native KGs enables Deeper Scientific Reasoning. Deeper Scientific Reasoning leads to Advance Scientific Discovery.

LLM Agents Limited: current LLM agents focus on abstracts, missing granular scientific details
Bottleneck in Discovery: oversight limits AI's capability for robust scientific discovery and reasoning
Agents-K1 Pipeline: end-to-end pipeline transforms raw scientific documents into knowledge graphs
Multimodal Parser: captures entities, evidence, citations, and typed relations across full papers
Agent-Native KGs: structured scientific knowledge graphs designed for LLM research agents
Deeper Scientific Reasoning: enables more robust and granular scientific reasoning by LLM agents
Advance Scientific Discovery: unlocks new potential for AI-driven scientific breakthroughs and insights

Visual TL;DRQuickExplainDeeper

Beyond Abstracts: A Multimodal Knowledge Extraction Pipeline

To address this gap, the researchers introduce Agents-K1, an end-to-end pipeline designed to transform raw scientific documents into agent-native scientific knowledge graphs. Unlike prior methods, Agents-K1 employs a multimodal parser with a five-module schema that captures entities, multimodal evidence, citations, and typed inter-entity relations across the entirety of a paper, not just its abstract. This comprehensive approach is powered by a 4B parameter information-extraction backbone, trained using GRPO with a rule-based reward mechanism, ensuring high fidelity in knowledge capture.

Scholar-KG: Scaling Scientific Knowledge Representation

The practical output of this pipeline is Scholar-KG, a vast scientific knowledge graph built by processing 2.46 million scientific papers across six subject areas. A subset of one million papers is being released, with the full dataset accessible via SCP. The Agents-K1 pipeline is not limited to this corpus; it can be extended to general-domain corpora and used for schema-conformant data synthesis. Experiments confirm Agents-K1's superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning, marking a significant advancement in how AI can interact with and reason over scientific literature.

From LLM Agents to Scientific Knowledge Graphs

Beyond Abstracts: A Multimodal Knowledge Extraction Pipeline

Related startups

Scholar-KG: Scaling Scientific Knowledge Representation

AI Daily Digest