Author: phcleverley

TEXT ANALYTICS MEETS GEOSCIENCE

I presented some text analytics work at a recent GeoScienceWorld (GSW) meeting in New Mexico, USA. GSW is a not-for-profit cooperation of Geological Societies, Associations & Institutes to disseminate geoscience information. First, some information on the trip, then the analytics!

FIELD TRIP

The Geological Field Trip was to Santa Fe and Abiquiu areas approx. 7,000 Ft above sea level. To the west across the Rio Grande Rift Basin are the Jemez Mountains (a super volcano) and the town of Los Alamos (home of the Manhattan project). To the North is the Colorado Plateau and Ghost Ranch where over 100 articulated skeletons of the Triassic Theropod Dinosaur Coelophysis have been found (the state fossil of New Mexico). These would have stood about one metre tall at the hips and up to three metres long.

1 Coelophysis

The red cliffs at Chimney Rock contain Triassic deposits overlain unconformably by cross bedded Jurassic desert sandstones topped with white limestone and gypsum in places. The beautiful scenery of Chimney Rock can be seen in photo below:

2 Chimney Rock

The view from the top of Chimney Rock is even more breath-taking in the photo I took below.

 3 Top Chimney Rock

ANALYTICS

There has been a continuing shift from just Information Retrieval (IR) systems – a search box and ten blue links, to the search for patterns, through what is increasingly called ‘insight engines’ within the cognitive computing paradigm. After all, big data is about small patterns.

All of the work below is approximately five days work and shows what is possible using some of the techniques available today in a short space of time. I wrote scripts using Python and used OpenSource utilities, these included some new techniques not published before. For the analytic content, I used the Society of Economic Geology (SEG) text corpus (1905-2017) as an example focused on mining, mainly of heavy metals. This consists of over 6,800 articles, 4.3 million lines of text and 35 million words. Several examples of analytics techniques are shown below increasing in their sophistication.

  1. Statistical Word Co-occurrence

The results of counting the frequency of terms and their adjacent associations are called n-grams. The image below is a graph of nodes (unigrams) and edges (bigram associations) automatically generated from the SEG text corpus of journals.

Click here to view a video showing how the text graph can be explored

Terms that have high authority (many links) can clearly be seen, along with rarer terms with few associative links. This is one way to explore text in an easy and visual way which can be linked to queries to the documents or contexts in which those words or associations occur.

4 Graph

These displays can be complemented by word clouds, with the most frequent associations stripped down to reveal the ‘more interesting’. Previous research I performed with geoscientists indicated that more frequent associations were ‘relevant but not interesting’. So stripping away the most frequent may be desirable, hyperlinking every word so scientists can drill down into the articles and sentences in which the associations are mentioned. The example below is for the search query ‘precambrian’.

x Word cloud

Just as the Google n-gram viewer allows someone to explore word usage through time, similar approaches can be taken with journals. The image below shows the trends of some common words in the SEG text corpus over the past twenty years. The y-axis is normalized relative word frequency (compared to total number of words used in that journal in that year).

6 Word Frequency

For example, from the image above it is plain to see that the popularity of the terms ‘gold’ and ‘hydrothermal’ (frequency of occurrence) have increased over the past twenty years, whilst the terms ‘manganese’ and ‘metamorphic’ have decreased. The increase in popularity of ‘gold’ has been theorized as possibly related to gold price which has also been plotted on this chart!

  1. Simple Entity Extraction

Extracting entities from text and associating them to a spatial context and geological time period has been of increasing interest to both academia and practice. The NSF EarthCube GeoDeepDive Cyber Infrastructure is one such example with some fascinating findings related to stromatolite distribution and sea water chemistry for example driven by patterns in text.

The example below shows the frequency of mentions (histogram in green) in SEG journal articles of geological periods (including their constituent sub-divisions). A knowledge representation (taxonomy) has been applied to the text in order to surface a pattern. This would appear to support a proposition that over more than a hundred years, the focus for mining geologists has been the Pre-Cambrian and Tertiary (Neogene and Paleocene) periods (denoted by the acronyms ‘PC’ and ‘T’ respectively) on the y-axis. The Silurian period appears to have been of least interest.

7 Entity Extraction

Plotting the world-wide distribution of copper ore by Geological age (orange line) as a form of ‘control’, supports the theory that patterns in journal text may surface ‘real’ trends and phenomena of interest.

  1. Numerical Data Extraction

Another relatively common technique is to extract numerical integer and float data. The chart below shows the results of automatically extracting integer and float data in association with the mnemonic ‘ppm’ (parts per million), plotting where it can be associated to a chemical element. The ppm data is on the y-axis (logarithmic), with mentions on the x-axis (1,709 were found in total). This could be turned into a hyper-linkable user interface, taking the user to the sentence/paragraph in question for each data point. This type of extraction is quite trivial although potentially under-used by organizations despite much of these data not necessarily being stored in structured databases.

8 PPM extraction

Another common technique is entity-entity matrices, showing how common two entities occur together in the same sentence or equivalent semantic text unit. The example below shows lithology and minerals for the SEG corpus.

x EAM

The associations are clustered using least squares to group similar lithology and mineral associations. You may just be able to pick out ‘Diamond’ on the middle right and its strongest association to Breccia and Conglomerate. These displays may reveal surprising associations worth of further exploration and are used extensively in biomedical for tasks such as gene discovery.

  1. Geo-Sentiment

Looking at individual geological formation names as they appear in text, it may be possible to derive ‘sentiment’ and ‘subjectivity’ of the formation. Using Part of Speech (POS) tagging, nouns that occur before the phrase ‘Formation’ or ‘Fm’ for example, can be extracted.

The cross-plot below shows some Formation names that appear around the search query term ‘leaching’. Polarity is on the x-axis, denoting how the Formation is perceived, negative (-1) versus a positive (+1) light. This is achieved by analysing the words (using Bayesian Statistical algorithms) that co-occur with the geological Formation mentions in text. Simplifying, terms such as ‘good’, ‘surprising’ and ‘abundance’ area deemed as ‘positive’ whereas terms such as ‘poor’, ‘error’ and ‘problem’ are deemed as negative. On the y-axis is subjectivity, from objective (0) to subjective (1). Terms such as ‘strongly suggest’ and ‘by far’ being indicative of subjective views. Standard sentiment algorithms cannot be used with accuracy on geoscience content, as the everyday terms ‘old’, ‘fault’ and ‘thick’ for example, which can be used in social media to denote negative views, are not negative in a geoscience sense!

10 sentiment

From these data, the Citronelle Formation appears in a negative light that may stimulate the scientist to investigate the sentences (context) which may lead to a learning event. Conversely, the high ‘subjectivity’ of the Popovich Formation may also trigger curiosity to understand the context which may lead to a re-interpretation.

  1. Automatic Geo-coding

Geo-referencing journal articles is not a new technique. However, in many cases it is the entire journal article (or just images within the article) that is referenced. In essence it is a summary of ‘aboutness’. The map below shows ‘mentions’ of the search query term concept ‘precambrian’ in the full text (body text) of all articles in the SEG corpus where they can be automatically geo-located. Comparing to techniques that only use keywords and/or abstracts of the journal article (the ‘information container’), there is a 200% enhancement (increase) of geo-locations. Clicking on the locations to show the ‘mentions’, the sentence or paragraphs in which the query term concept is mentioned, may yield insights that simply geo-locating entire journal articles cannot.

11 Geocoding

Instead of colour coding the frequency of occurrence by colour, bubble plots can be used, where the size of the bubble relates to the frequency of mention. This can also be combined with external data to the text corpus. An example is shown below integrating with the surface geology on the United States using US Geological Survey (USGS) WMS GIS spatial data.

12 Geocoding USGS

More granular geo-coding is simply a case of adding in more specific lookup lists for latitude and longitude of any entity.

13 Geocoding Mine

  1. Topics

Another common technique to ‘summarize’ the essence of what ‘lies beneath’ in text, relies on a range of techniques from complex word co-occurrence patterns, Principal Component Analysis (PCA) to Eigen Values and Vectors. The image below shows the clusters of topics for the search query ‘leaching’ in the SEG text corpus. These techniques can be applied at any level of granularity, on an abstract, a single article, a whole corpus or as a delta between journals or corpora. Topic modelling is typically applied longitudinally (through time) to surface changes in the intent behind text.

14 Topic Models

 

  1. Mathematical Word Vectors and Hypothesis Testing

The word co-occurrence patterns of any entity can be converted into a mathematical vector and the similarity compared with one another. From literature reviews, these techniques have been applied more sparsely/largely non-existent within the geoscience discipline compared to simple entity extraction and association.

The cross-plot below shows geological periods in the SEG corpus plotted to their similarity to the word vector of ‘volcanics’ on the x-axis and ‘limestone’ on the y-axis. In the bottom right, the Pre-Cambrian Archean period (2-4 Billion years ago largely before life on earth) is very similar to ‘volcanics’ and not ‘limestone’ which is what you would expect. Conversely, the Mississippian (top middle) is very similar to ‘limestone’ and not ‘volcanics’ which is what you would expect as sea level was very high with warm shallow seas. So again, this supports the theory that word vectors from text can surface real word patterns that make sense. Perhaps they can also reveal what we don’t yet know.

15 word vectors

A variation of this technique which it is believed may have never been tried before in the Geosciences, is combining data from a database, with word vector information. In the cross-plot below, US states (e.g. Florida, Wyoming, Oregon) are plotted by their annual rainfall on the y-axis (from the National Oceanic and Atmospheric Administration (NOAA) database) and their similarity to the word vector ‘Arsenic’ in the SEG corpus on the x-axis. A weak correlation (R2=0.26) is found, implying more similarity to the word vector ‘Arsenic’ with decreasing rainfall. Simplistically, this could be due to more arid environments (less rainfall) leading to higher Ph conditions with Arsenic more likely to mobilise from the underlying geology into groundwater and aquifers.

16 Word Vectors Arsenic

This could point to the potential value of combining word vector similarities from text with traditional measured data stored in structured databases. The whole may be greater than the sum of the parts.

A final example integrates data from the US National Cancer Database (CNC), Alzheimer’s Association and again, text vectors from the SEG corpus. The average cancer mortality rate per US state (per 100,000 people) is plotted on the x-axis, the average Alzheimer’s mortality rate per US state is plotted on the y-axis. The similarity of US state word vectors to the heavy metal ‘Cadmium’ word vector is shown by the colour and size of marker. The more similar, the larger the circle. Those above average similarity in the sample are coloured orange, below average are coloured blue. There is no statistically significant correlation and even if there was, correlation is of course not causation. There are many demographic and socio-economic factors at play in a complex system. However, these techniques may be useful in surfacing patterns that warrant further investigation or hypothesis testing.

17 Word Vectors Cadmium

 

  1. Automated Discovery

The final example compares the linkage between the word vector of every concept in the corpus, with the word vector of every other concept in the corpus and their similarity to the word vectors of a hypothesized theme. In the example below, the theme are elements typically associated with geogenesis (natural) contamination in groundwater (e.g. Aluminium, Iron, Copper, Mercury, Lead).

A new simple ratio has been developed (Cleverley 2017) by combining linear regression with a scaling factor to represent the individual similarity of the concept(s) to the theme, to surface potentially the ‘unusual’ associations which may warrant further discovery. In the run below, over 150Million word vector combinations were tested by an automated algorithm. This took four hours on a standard laptop.

 18 Automated discovery

19 Equation

For example, ‘Argon dating’ and ‘Feldspar Chlorite’ as individual concepts, do not have high similarity to the theme. However, as an association, they have a disproportionately higher correlation than one would expect, which may warrant further exploration to identify a causal mechanism.

Just as Swanson (1988) manually identified (inferred) a link between magnesium deficiency and migranes, that was not present in any single article, but they shared similar concepts, these automated techniques could highlight new associations. This could lead to new knowledge and ultimately, new scientific discoveries that are hidden amongst our text in plain sight.

Knowledge is socially constructed and different text copora will likely lead to different word vectors for the same concepts depending on the sub-discipline and nature of the text. These differences may also surface clues to new phenomena of interest.

Based on literature reviews, the use of word vector similarities of entities with external data is potentially under-utilized in the geosciences. Future work will most likely expand the research to apply to much larger quantities of journals and further develop automated approaches. Questions, comments and ideas are always welcome, feel free to contact me on the email below.

Paul Cleverley PhD

Researcher

Robert Gordon University

Email: p.h.cleverley@rgu.ac.uk

Blog: www.paulhcleverley.com

A PDF of this article is available By Clicking <Here>

References

20 - References

 

Are Search Algorithms Neutral?

cards

Enterprise search and discovery algorithms are often perceived as objective and neutral helping us overcome our own biases, even if they don’t always produce what we want or need. The Cognitive Computing narrative is one where machines read vast amount of text to compensate for human cognitive bias and potential organizational dogma. The mantra is not to produce the ‘right answer’, but the ‘best available’. But can search algorithms be truly objective and unbiased themselves?

Search Engine Bias

Various phenomena that involve the manipulation of search engine results are typically referred to as search engine bias. Just like bias however, is not easy to define and can be hard to detect. What is the difference between bias and a point of view? Take the incompatible statements, “It is a truism that every author is biased in favour of the claim he is making.”, “Bias and prejudice are forms of error”.

There is avoidable bias (such as promoting a narrow partisan view when a broader non-partisan view ought to be taken), there is technical bias (such as related to sampling) and unavoidable bias (such as news reporting). This is not to criticise news reporting, but to guard against any view that reporting can be absolutely neutral. It is proposed that many aspects of search engine ranking is an unavoidable bias, the danger (just like news reporting) would be to view it as a neutral rendering of data. It may be better to talk in terms of pre-dispositions.

Search Engine Optimization

Search ranking involves automated and human interventions according to some design parameter choices (Sometimes weightings are called ‘bias values’). Some content will be promoted and other content marginalized. Search Engine Optimization (SEO) is an iterative process to maintain/improve the search result quality that may see some content rise and others fall as a result of changes. Some scholars have sought to measure bias of web search engines by their deviation from a relative ‘norm’ of their peers. In previous articles and research papers, I have discussed the positive elements of using search algorithms designed to specifically stimulate the unexpected, insightful and valuable: nudging search engines into the role of creative assistant, rather than just a time saver. This article looks at the predispositions (bias) that may be inherent in search algorithms.

How we come to know things

Internet search engines are ubiquitous, they have become an epistemology, ‘how we come to know things’ which raises ethical issues. This has prompted further scrutiny, to understand to what extent search algorithms and human interventions are truly unbiased. Nevertheless, some people argue that a search algorithm can never be neutral. Behind every algorithm is effectively a person, organization or society which created it, that is likely to display biases of some form, so any rendering by search engines is value-laden not value-free.

Knowledge Representations

Algorithms themselves often incorporate query rules and Knowledge Organization Systems (KOS) such as taxonomies or ontologies. These KOS are ‘one version’ of reality and whilst they can enhance information discovery, these schemas may also reinforce dogma and potentially blind us to new discoveries. Whilst some indicate new cognitive computing techniques allow us to evaluate without bias, It may be a falsehood to say automated systems lack any bias.

Social Voting

Another aspect utilized in algorithms is explicit social voting (within sub-cultures and societies), creating a form of ‘standardization’. The more an item is viewed (clickthrough) in search results in context to a specific search query, the more popular it is perceived. Some information may therefore be ‘censored’ through its obscurity, where relevance is not determined by its usefulness, but by its popularity, which may reinforce existing power structures and stereotypes. Items at the top of any search results list may exhibit ‘presentation bias’. Once some items get a high search rank (85-95% of people never click on page 2 of search engines), a self-fulfilling prophesy may come into effect (the Matthew Effect), the rich get richer and the poor get poorer.

Personalization

Some algorithms also make use of user context (such as location and previous searches), a form of ‘personalization’. Some scholars feel that personalization has/will mitigate search engine search results ranking bias, producing tailored results. At the same time, individually tailored results unique to each person may place the searcher in an over-personalised filter bubble.

Technical Sample Bias

In addition to these ‘standardized’ and ‘personalized’ aspects of algorithms, there is technical bias related to the sample in the search index corpus. If the text within the search engine corpus is itself skewed then you will have a classic case of sampling bias. This may explain why the ‘Bing predicts’ big data algorithm that followed the United Kingdom’s referendum on the European Union (EU) predicted a vote to remain by 55% on June 23rd 2016. Social media trends may not reflect everyone’s opinions, the corpus may be prejudiced. Like any models, they can be true until they are not. Significant failures in Google Flu trends algorithms is another example, with some stating that ‘algorithm accountability’ may emerge to be one of the biggest problems of our time.

Human Judges

In addition to automated rules and signs, search algorithms also undergo constant evaluation and tweaking by people in the background, with ratings generated by people judging how ‘good’ results are. It is therefore unlikely for search results to be completely untampered with in some way.

Power to Influence Elections

Taking a more sinister turn, studies have shown that manipulation of search result ranking in Google could potentially affect people’s attitudes towards health risk, without people being aware they were being fed biased information. Some scholars provide evidence that manipulation of search engine algorithms could even influence democracy in country elections. Evidence appears to exist for search engines biasing results both towards the left and right during elections, although (arguably) big data may make it easier to find evidence to support any particular point of view you wish to take.

Bias in Enterprise Search

Recent research involving three separate enterprise search technology/deployments, points to algorithmic bias also existing behind an organizations firewall within enterprise search and discovery technology. For example, enterprise search technology from at least one software vendor, had default ‘factory shipped’ search ranking configuration parameters, that gave preference (ranking boosts) to its own document formats, above that of the formats of its competitors.

Other examples in the enterprise include a bias in some ‘factory shipped’ enterprise search algorithms towards their country of origin. For example, in one search engine that automatically geo-references search results to display on a map, any document containing the phrase ‘west coast’ was assumed to be about California. In another deployment that had indexed third party information, algorithms were designed to favour small information providers rather than large ones, simply for performance reasons; a case perhaps of an enterprise search algorithm making arbitrary ‘editorial’ choices.

It is commonplace in Enterprise search deployments for engineers with the best intentions to over-ride automatically generated organic search results using promoted results (often termed best bets) and tweak results through user defined ‘gold standard’ test sets and search log mining hunting for better search results quality. Some search engine practitioners state engineers will have no idea what relevant results are, so involving users/customers to rate results is essential. Some organizations that have performed these types of search evaluations and tuning with test sets of documents, made comments during enterprise search conferences, that what one expert user feels is the optimal set of results for a search term, can often be significantly different to another expert in the enterprise.

Filtering of results is also commonplace within enterprise search deployments and SharePoint search, to remove/hide results deemed undesirable, inappropriate or not useful, using negative filters of ‘dirty words’. For example not showing results where the word ‘conference’ is mentioned. It would be an interesting question (dilemma?) if management in an organization ever asked their enterprise search team (using the latest machine learning techniques) to ‘hide’ search results for any content it felt portrayed the company in a bad light – such as comments made on internal company enterprise social blogs by staff about its HR policies. Some may feel this is acceptable information governance practice, others may feel it is unethical practice.

Conclusion

For a variety of reasons (such as complexity and trade secrets) it may not be possible to ever fully understand what enterprise search algorithms are doing and the intent behind them, although some standards exist (such as Okapi BM25). Due to this opacity, a significant amount of trust is therefore placed in the hands of those that design and deploy search algorithms. Adopting a position of unconditional faith in algorithms may pose many risks. Increasing awareness of what biases already exist (through accident or design) or could exist in the future, might be a prudent step to take.

As we are all predisposed to certain views, it seems likely that search engines will be as well.

Paul H. Cleverley

Researcher, Robert Gordon University

The “4H” Model for inferring information and knowledge culture from search technology artefacts

It is still a work-in-progress, however I have blended more elements of the ‘modality model for search’ into some of my recent thinking on how search technology artefacts could be used to infer aspects of information and knowledge culture.

information culture

A focus on search to check for information compliance of various aspects is termed ‘HOLD TO ACCOUNT‘ and will likely lead to a preponderance of dashboard metrics and reports of the information asset. A focus on social connections (between people and their information) termed ‘HARNESS‘ will likely yield a personalized approach (like popular social media sites) using search driven algorithms to show people what is going on in their network. This may lead to unexpected, insightful and valuable connections.

A focus on using what is known to exist termed ‘HARVEST‘ is likely to lead to the deployment of a corporate ‘Google like’ general purpose search engine. The focus is on Information Retrieval (IR). This is likely to infer aspects of Knowledge Management (KM) culture as this relates to exploiting information, rather than managing information which is arguably the focus of an Information Management (IM) culture. More extreme forms of harvesting may see domain specific search applications deployed, tuned for very specific work tasks and goals.

A focus on ‘what might exist’ or ‘what could be’ is termed ‘HUNT and HYPOTHESIZE’. This may likely lead to a focus on rich visual exploratory search interfaces of various media and analytics. The focus is on the search for patterns rather than just retrieving information. This is also likely to lead to unexpected, insightful and valuable information encounters.

Machine Learning techniques can be present at all parts of the model in some form and will likely be necessary within all quadrants as information volumes are too large for people to practically read. However, the sense-making of staff will be crucial as is information literacy in general. Noticing what is useful and valuable and generating new theories is never ‘in the data’. Enterprise search & discovery capability is likely to be a system of which technology is just one part.

Most organizations will contain the 4H’s to various degrees, however the presence or absence of certain technology artefacts or features within search applications, may be at odds with overall organizational culture.

There are a few other angles I am considering, its a work-in-progress!

Some references that shaped the thinking:

ARNOLD, S.E., 2014a. Redefining Search: Enterprise Search and Big Data. Information Today, June 2014, pp. 22-23.
CHOO, C.W., 2013. Information culture and organizational effectiveness. International Journal of Information Management, 33, pp. 775-779.
CURRY, A. and MOORE, C., 2003. Assessing information culture – an exploratory model. International Journal of Information Management, 23, pp. 91-110.
DAVIES, A., FIDLER, D. and GORBIS, M., 2011. Future Work Skills 2020. [online]. University of Phoenix Research Institute. Available from: http://www.iftf.org/uploads/media/SR-1382A_UPRI_future_work_skills_sm.pdf
EASTWOOD, G., 2005. Enterprise Search tools move from luxury item to business essential as data builds up. [online]. Computerworld. Available from: http://www.computerweekly.com/feature/Enterprise-search-tools-move-from-luxury-item-to-business-essential-as-data-builds-up [accessed January 2016].
GINMAN, M., 1987. Information culture and business performance. International Association of Technological University Libraries (IATUL) Quarterly, 2(2), pp. 93-106.
GRANT, S. and SCHYMIK, G., 2014. Using Work System Theory to Explain Enterprise Search Dissatisfaction. Proceedings of the Information Systems Educators Conference (ISECON). 6-9 November 2014: Baltimore, Maryland, USA, pp. 1-11.
GREFENSTETTE, G. and WILBER, L., 2011. Search-Based Applications: At the Confluence of Search and Database Technologies. In: MARCHIONINI, G., Ed. Synthesis Lectures on Information Concepts, Retrieval, and Services. USA: Morgan & Claypool Publishers.
HEILBRONER, R.L., 1967. Do Machines Make History? Technology and Culture, 8(3), pp. 335-345.
HILLIS, K., PETIT, M. and JARRETT, K., 2013. Google and the Culture of Search. UK: Routledge.
HOFSTEDE, G. et al., 1990. Measuring Organizational Cultures: A Qualitative and Quantitative Study across Twenty Cases. Administrative Science Quarterly, 35(2), pp. 286-316.
JACKSON, S., 2011. Organizational culture and information systems adoption: A three-perspective approach. Information and Organization, 21, pp. 57-83.
LEIDNER, D.E. and KAYWORTH, T., 2006. A review of culture in information systems research: Towards a Theory of Information Technology Culture Conflict. Management Information Systems (MIS) Quarterly, 30(2), pp. 357-399.
MARTIN, J., 2002. Organizational culture: Mapping the terrain. Thousand Oaks, CA, USA: Sage Publications.
MOLNAR, A., 2015. The 5 C’s of Enterprise Search. [online]. Available from: http://www.searchexplained.com/the-five-cs-of-enterprise-search/
PETTIGREW, A. M., 1979. On Studying Organizational Cultures. Administrative Science Quarterly, 24(4), pp. 570-581.
POSTMAN, N., 1993. Technopoly: The surrender of culture to technology. New York, USA: Vintage Books.
SCHEIN, E.H., 2004. Organizational Culture and Leadership. 3rd ed. San Francisco, USA: Jossey-Bass, pp. 3-23.
TURNER, F., 2006. From Counterculture to Cyberculture: Steward Brand, the Whole Earth Network and the rise of Digital Utopianism. Chicago, USA: University of Chicago Press.
VAN DER SPEK, R. and SPIJKERVET, A., 1997. Knowledge management: Dealing intelligently with knowledge. In: LIEBOWITZ, J. and WILCOX, L.C., Eds. Knowledge management and its integrative elements: Boca Raton, USA: CRC Press, pp. 31-59.
WATKINS, M., 2013. What is Organizational Culture? And Why Should we Care? [online]. Harvard Business Review (HBR). Available from: https://hbr.org/2013/05/what-is-organizational-culture

Enterprise Search & Discovery engines: How we might come to ‘know’..

Delighted to be invited to give a seminar to staff and students yesterday (22nd March) at Edinburgh Napier University School for Computing and Centre for Social Informatics. Discussed how enterprises may wish to experiment with their search engines and user interfaces in an attempt to deliberately stimulate unexpected encounters which may lead to new knowledge creation. These opportunities may otherwise remain hidden using the ‘classic’ search box, ten blue links and some metadata driven refiners (faceted search) ordered by popularity or frequency of occurrence.

Napier.png

In the atrium at the university is a statue of John Napier, after which the university is named. He is best known for discovering logarithms.

PhD Success!

Delighted to share the news that after a successful defence of my thesis this week I have been awarded my PhD. A BIG thank you to everyone who has contributed, supported or followed my progress these past 4 years, it is very much appreciated. I am very excited at the collaborations and projects that are presenting themselves with a number of organizations and universities, so look forward to continuing the research on how enterprises can transform their capabilities to find & discover information. I will continue to blog post when I can.

PhD1.jpg

Organizational Information Culture and Technology Artefacts

Organizational information culture can be defined as the behavioural norms and values shown by employees towards information. Researching enterprise search technology artefacts that represent aspects of the culture in which they are deployed, some organizations may not score highly in supporting an information culture that enables innovation and creativity. In addition to mapping information cultures through observation, surveys and interviews, there may be opportunities to infer aspects of culture through abductive reasoning based on the nature of technology artefacts (or absence of them). Its a work-in-progress, I’ve adapted Choo’s (2013) and Cameron and Quinn’s (2011) typology as applied to enterprise search.

Technology artefacts

I have kept Choo’s dimensions on the y-axis and four typology descriptions, adding my own text in red to describe the relationships to technology artefacts and x-axis related to enterprise search and discovery capability continuum (information ‘containers’ versus entities/concepts) building on the ‘modality model’. The presence or absence of technologies (or aspects of technology features) may imply certain information cultures (that have been inscribed into the technology).

Can we improve corporate search?

I had an article published in Digital Energy Journal this month. Link here

Most of us expect a search engine to be a tool which delivers us results such as documents, web pages, people profiles, lessons learnt and best practices when we type something into a box. But can it do more? Oil and gas data scientist Paul Cleverley is doing a PhD to try to find out.

Development of ‘enterprise search’ technology is fairly stagnated in companies, said Paul Cleverley, speaking at the Digital Energy Journal Aberdeen conference in May, ‘Subsurface computing and competitive advantage’.

It is usually seen as a utility, not something which affects the bottom line of the company. ‘Quite often the user interface is a pretty bland search box,’ he said. This may follow the theory-in-use dominant culture of Google which drives our expectations, but are some latent needs going unmet?

Considering the benefit to the company of making it easier for people to find what they are looking for and what may be valuable, perhaps it is worth investing in a better search engine architecture and design, Mr Cleverley believes.

Company information as a whole may be under-exploited and under-explored. Where the ‘whole is greater than the sum of the parts’, using the right approach, company information may be able to surface an answer or association that is not present in any one single document. Using a common metaphor, as well as finding ‘needles in haystacks’, smashing together information haystacks and finding ‘new needles’ could be a gamechanger.

This isn’t just an oil and gas problem. Scientists and engineers are generally interested in similar concepts in any industry, he said. Some of the concepts from the research have been shared with NASA and incorporated into their communication and designs.

Better searching tools can help geoscientists be more objective – weighing up a range of different possibilities, rather than sorting for information which fits their hypothesis.

Many of us may have heard executives saying that having a geologist that knows the basin inside out is a valuable asset – but can also be a liability, if he or she is not willing to engage with an alternate point of view about how it works.

Paul Cleverley is an information scientist. He is in his 4th year of a PhD at Robert Gordon University, Aberdeen…