
Large language models accelerate construction of materials property databases.
TSUKUBA, Japan, Jan 8, 2026 - (ACN Newswire) - Technologies that underpin modern society, such as smartphones and automobiles, rely on a diverse range of functional materials. Materials scientists are therefore working to develop and improve new materials, but predicting material properties is no simple task. Data science is key to transforming this field, and new tools powered by artificial intelligence are expected to accelerate the exploration, collection, and management of materials property data worldwide.
![]() |
| Researchers and artificial intelligence work together to collect experimental materials science data from papers worldwide and build a database. (Copyright: Kenji Tashiro. Instagram: ripplemarkmaker. CC-BY-4.0) |
The relationship between functional materials and their properties is complex. Even slight differences in composition or synthesis methods can affect electronic states and microstructures, often resulting in entirely different properties. For this reason, theoretical models alone cannot provide reliable predictions, and the intuition of researchers and engineers built on years of experience has played a significant role.
Machine learning is a technology that can learn empirical trends rather than relying on theory. By applying machine learning to experimental data in materials science, it may be possible to replicate such intuition computationally. Large language models (LLMs), such as ChatGPT, now support the daily lives of many people and are capable of flexible information extraction that takes background knowledge and context into account. This opens up the possibility of automating the process of converting complex information sources like scientific papers into structured data. If large-scale datasets of experimental data can be built through this approach, it is expected to enable researchers to gain inspiration through a bird's-eye view of the data, as well as to realize property predictions based on empirical trends using machine learning.
A team led by Dr. Yukari Katsura, a Senior Researcher at the National Institute for Materials Science (NIMS), has focused on this potential and developed two new tools to accelerate the construction of Starrydata, a materials property database built from data collected from scientific papers. This work was recently published in the journal Science and Technology of Advanced Materials: Methods.
"Graphs in the millions of papers published to date contain valuable experimental data collected by past researchers, and much of it remains untapped," says Prof. Katsura. In the Starrydata project, which she launched in 2015, data collection from papers was performed manually and supported by the independently developed Starrydata2 web system, successfully amassing an unprecedented volume of experimental data. The new tools are designed to further streamline this data collection process. "We found that by specifying a data structure and giving instructions to an LLM, we can accurately and comprehensively extract information about figures, tables, and samples from the text of paper PDFs across a wide range of fields."
Prof. Katsura added, "Many publishers prohibit the use of artificial intelligence on paper PDFs, so we are currently developing the system to target open-access papers."
The first tool, Starrydata Auto-Suggestion for Sample Information, is a function that reads the text of a paper and suggests candidate entries for data fields pre-designed for each materials domain; it is already integrated into the Starrydata2 web system. When a user pastes text from a paper's abstract or experimental methods section, it is sent to OpenAI's GPT via API, and candidate entries in English are automatically displayed below each input field.
The second tool, Starrydata Auto-Summary GPT, deconstructs an entire open-access paper PDF uploaded by the user and automatically summarizes all descriptions of figures, tables, and samples appearing in the paper as a structured data in JSON format. The JSON data output is generated using ChatGPT's custom GPT feature, and the resulting data can be viewed as an easy-to-read table in a web browser. Although this data is not currently incorporated directly into the Starrydata database, it dramatically accelerates the work of data collectors in quickly locating target data and entering information. Note that reading data points from graph images is difficult for LLMs, so this task is performed by data collectors using an independently developed semi-automated tool.
"A paper is a logical structure assembled to convey the author's claims, but by deconstructing it and returning it to the form of experimental data, other researchers can also use it for their own research," says Dr. Katsura. "In this way, we are aiming for a future where experimental data from all materials science fields can be shared in digital format and viewed from a bird's-eye perspective."
At present, Starrydata has only progressed in building databases for certain materials science fields, such as thermoelectric materials that convert heat and electricity, and magnets. However, as an open dataset that can be used for new materials development, it is beginning to be utilized primarily by leading researchers around the world. The team is advancing their research with the aim of raising broader awareness of the potential of such large-scale experimental data and establishing paper data collection as a recognized form of research within the scientific community.
Further information
Yukari Katsura
Senior Researcher, National Institute for Materials Science (NIMS)
KATSURA.Yukari@nims.go.jp
(Yukari Katsura is also an associate professor at University of Tsukuba and guest researcher at RIKEN)
Paper: https://doi.org/10.1080/27660400.2025.2590811
About Science and Technology of Advanced Materials: Methods (STAM-M)
STAM Methods is an open access sister journal of Science and Technology of Advanced Materials (STAM), and focuses on emergent methods and tools for improving and/or accelerating materials developments, such as methodology, apparatus, instrumentation, modeling, high-through put data collection, materials/process informatics, databases, and programming. https://www.tandfonline.com/STAM-M
Dr Kazuya Saito
STAM Methods Publishing Director
SAITO.Kazuya@nims.go.jp
Press release distributed by Asia Research News for Science and Technology of Advanced Materials.
Source: Science and Technology of Advanced Materials
Copyright 2026 ACN Newswire . All rights reserved.
© 2026 JCN Newswire

