GeoGalactica – The largest geoscience Large Language Model (LLM) – A Geodyssey – Geoscience Text Analytics and Enterprise Search Research

A couple of weeks ago Lin et al (2023) unveiled their geoscience fine tuned Large Language Model (LLM) a 30B parameter geoscience fine tuned version of Meta AI’s ‘OpenSource’ Galactica LLM – to create GeoGalactica.

This was as part of the Deep-time Digital Earth (DDE) initiative funded by NSF China.

They scraped over 6 million geoscience articles from the Internet (65B tokens) fine tuning with 1 million geoscience question answer pairs.

It is reported using the ‘GeoBench’ it out performed ChatGPT on geoscience questions and previous Llama based geoscience LLM’s such as K2. For open-ended tasks evaluated by senior geoscientists, GeoGalactica was reported to be competitive against other models with ChatGPT still top in some tasks.

Paper: https://arxiv.org/abs/2401.00434

GitHub: https://github.com/geobrain-ai/geogalactica