IBM-NASA INDUS Large Language Models

NASA and IBM Large Language Models for Earth Science (INDUS). The INDUS encoders were trained on a corpus of 60 billion tokens encompassing astrophysics, planetary science, Earth science, heliophysics, biological, and physical sciences data. According to IBM/NASA the models are freely available on Huggingface and they will be releasing benchmark datasets.

Under the hood, from an Earth Science (Geoscience) Perspective, it looks to have been trained on two main open access sources in this discipline. The metadata from NASA’s own Earthdata (Common Metadata Repository (CMR)) and the American Geophysical Union (AGU).

It appears a RoBERTa base has been used and fine tuned with cross disciplinary literature and question and answer pairs , the researchers have focused on scientific cross-disciplinary content for embeddings. These models could therefore, provide better results for organisations using Named Entity Recognition (NER) and Retrieval Augmented Generation (RAG) techniques on scientific content in these areas, as well as embeddings for scientific discovery. It will be interesting to learn more and an area for further research and testing.

Paper here: https://arxiv.org/pdf/2405.10725

Leave a comment

Website Powered by WordPress.com.

Up ↑