Spatialising word vectors

In areas of sparse data, patterns in text may be a helpful geoscience screening tool. One technique may be to build a text embedding model which allows you to compare the vectors of target geological concepts to location names.

Disambiguation here is vitally important. The prototype example shown is for the vector of ‘Monzonite’ to vectors of geographical locations (over half a million). The lighter the colours, the greater the similarity. A cutoff of 0.2 is applied to the cosine similarity to screen out likely noise.

Monzonite is a coarse grained igneous plutonic rock. It can host hydrothermal ores such as copper porphyry. As we are dealing with vectors, its possible to have a location with similarity, which does not mention the concept in question. However, it may share many of the same associations. This could be useful for discovery.

There will obviously be biases from a number of perspectives. For example, as data comes from the USGS one sees likely artificial truncations at the borders wrt data points. Other data artefacts and biases are discussed on my recent posts, and as always we need to be sceptical where we see something that does not fit with our understanding (whilst keeping an open mind at the same time). More work to do here – especially on disambiguation – but thought I’d share.

Again, I don’t have a specific research question in mind, just exploring the data and possibilities.

hashtag#geology hashtag#geoscience hashtag#earthscience hashtag#artificialintelligence hashtag#ai hashtag#datascience hashtag#machinelearning hashtag#unstructureddata

Leave a comment

Website Powered by WordPress.com.

Up ↑