AI Tools Extract Buried Experimental Data from Scientific Papers to Accelerate Materials Discovery

January 9th, 2026 3:00 AM
By: Newsworthy Staff

Researchers have developed AI-powered tools that automatically extract experimental data from scientific papers, accelerating the creation of materials property databases that could transform materials science by enabling data-driven discovery and prediction.

AI Tools Extract Buried Experimental Data from Scientific Papers to Accelerate Materials Discovery

Materials scientists developing technologies from smartphones to automobiles face complex challenges in predicting material properties, where slight compositional differences can yield entirely different characteristics. While machine learning offers potential for computational intuition, its application requires large-scale experimental datasets that remain largely buried within millions of published papers. A team led by Dr. Yukari Katsura at the National Institute for Materials Science has developed two artificial intelligence tools to accelerate construction of materials property databases by extracting structured data from scientific literature.

The research, published in Science and Technology of Advanced Materials: Methods, addresses a fundamental bottleneck in materials informatics. "Graphs in the millions of papers published to date contain valuable experimental data collected by past researchers, and much of it remains untapped," explained Katsura. Her Starrydata project, launched in 2015, previously relied on manual data collection supported by the Starrydata2 web system. The new tools leverage large language models like ChatGPT to automate extraction of information about figures, tables, and samples from paper PDFs across diverse materials fields.

The first tool, Starrydata Auto-Suggestion for Sample Information, is already integrated into the Starrydata2 system. When users paste text from a paper's abstract or methods section, it sends content to OpenAI's GPT via API and automatically displays candidate entries for pre-designed data fields specific to each materials domain. The second tool, Starrydata Auto-Summary GPT, deconstructs entire open-access paper PDFs and summarizes all descriptions of figures, tables, and samples as structured JSON data using ChatGPT's custom GPT feature. This output can be viewed as easy-to-read tables in web browsers, dramatically accelerating data collectors' work in locating and entering target information.

"A paper is a logical structure assembled to convey the author's claims, but by deconstructing it and returning it to the form of experimental data, other researchers can also use it for their own research," said Katsura. The approach specifically targets open-access papers due to publisher restrictions on AI use with PDFs. While LLMs cannot yet read data points from graph images—a task handled by data collectors using semi-automated tools—the text extraction capabilities represent significant progress toward automated data collection.

The implications extend beyond efficiency gains. Large-scale datasets built through this approach could enable researchers to gain inspiration through bird's-eye views of experimental data and realize property predictions based on empirical trends using machine learning. Currently, Starrydata has progressed in building databases for specific fields like thermoelectric materials and magnets, but as an open dataset for new materials development, it's beginning to be utilized by leading researchers worldwide. The team aims to establish paper data collection as a recognized research form within the scientific community, moving toward a future where experimental data from all materials science fields can be shared digitally and viewed comprehensively.

This development matters because it addresses a critical gap in materials science infrastructure. Functional materials underpin modern technologies, but their development has been hampered by fragmented, inaccessible experimental data. By automating extraction of buried data from scientific literature, these tools could accelerate materials discovery cycles, reduce reliance on researcher intuition alone, and enable more systematic, data-driven approaches to materials design. The potential for cross-pollination between previously isolated research domains could unlock novel material combinations and properties, ultimately speeding innovation in everything from energy technologies to electronic devices.

Source Statement

This news article relied primarily on a press release disributed by NewMediaWire. You can read the source press release here,

blockchain registration record for the source press release.
;