Preview of text of a whitepaper, results presented by RioTinto at the Prospectors and Developers Association of Canada 2012 International Convention, Trade Show & Investors Exchange in Toronto, Canada
"Enabling an Exploration Data Access Strategy with Spatial Discovery"
Jess Kozman, QBASE; David Hedge, Conducive Pty Ltd
“..Spatial attributes are pervasive in energy geotechnical data, information and knowledge elements, and users expect enterprise search solutions to be map-enabled.. “
Search solutions, similar to those already deployed at oil and gas companies, are now being piloted in both the mineral extraction and carbon sequestration segments of the resources industry. A recent pilot project in the exploration division of a major mining company within Australia has demonstrated the significant value gained from the effective integration of Enterprise Search technology, Natural Language Processing (NLP) for geographic positioning (geo-tagging) and Portal delivery. Branded internally as Spatial Discovery, the pilot project is part of a larger strategy to discover globally, access regionally, and manage locally the data, information and knowledge elements utilized in the mineral exploration division, namely; Geochemical and Geophysical, documents (both internal and external as well as those stored in structured and unstructured data repositories), GIS map data, and geo-referenced image mosaics. The initial stage involved validating a technology for spatial searches to enable streamlined, intelligent access to a collection of scanned documents by secured users, through scheduled automated crawls for geo-tagging, and following corporate security guidelines. This stage also included administrator training, including defining procedures for managing the document collections, procedures for maintaining the hardware appliance used for generating the spatial index, customizing the User Search Interface, and developing and implementing support roles and responsibilities. Functionality testing was run on a subset of documents representative of the enterprise collections that would need to be addressed by the Exploration Data Access (EDA) solution.
The next stages will focus on broadening the user base, with a goal of having access and use by all corporate geoscientists. This will be accomplished by defining, prioritizing and publicizing the spatial indexing of additional document collections, developing a methodology for managing and enhancing a custom gazetteer with geographic place names specific to the Australian mineral industry, and integrating with existing GIS map layers such as land rights. A proof of concept user interface will be rolled out to a selected User Reference group for input and feedback. The ongoing stages will be supported by utilizing a recently delivered testing and development hardware appliance, implementing connectors to existing electronic document management systems (EDMS) as well as portal delivery systems and SQL data stores, and a complete feature rich enhancement of the User Interface. This stage will also align the Spatial Discovery project with the larger Exploration Data Access (EDA) initiative and provide a proof of concept for enterprise search strategies based on best practices from other resource industries.
An essential part of this stage is the creation of a Customized Gazetteer to work with the NLP engine and geo-tagging software, which identifies geographic place names in text from multiple formats of unstructured documents and categorizes the index by location types such as country, region, populated place, mines, Unique Well Identifiers (UWI), camps, or concession identifiers. The index also allows sorting of search results by relevance based on natural language context, and Geo-Confidence, or the relative certainty that a text string represents a particular place on a map.
Future improvements to the system will include increasing the confidence in geo-tagging to correctly identify ambiguous text strings such as “WA” in locations and street addresses from context. This will correctly give documents referencing Asia Pacific regions a higher probability of “WA” referring to “Western Australia” instead of the default assignment to “Washington”, the state in the United States. The natural language processing engine can be trained using a GeoData Model (GDM) to understand such distinctions from the context of the document, and can utilize international naming standards such as the ISO 3166-2 list of postal abbreviations for political subdivisions such as states, provinces, and territories. The capabilities of the natural language processing engine to use grammatical and proximity context become more important for the correct map location of documents when a populated place such as “Belmont, WA” exists frequently in company documents because of the location of a data center in Western Australia, for example, but could be confused with the city of Belmont, Washington, in the United States without contextual clues.
The NLP engine is made more robust by an understanding of relative text strings such as “30 km NW of Darwin” and support for foreign language grammar and special characters such as those in French and Spanish. The current NPL engine also has the ability to locate and index date text strings in documents so that documents can be located temporally as well as spatially. Next stages of the deployment will include improvements to the current Basic User Interface such as automatic refresh of map views and document counts based on selection option context, support for the creation of “electronic data room” collections in EDMS deployments, URL mapping at directory levels above a selected document, and the capture of backup configurations to preserve snapshots of the index for version control of dynamic document collections such as websites and news feeds. The proof of concept User Interface already includes some innovative uses of user interface controls, such as user-selectable opacities for map layers, the ability to “lock” map refreshes during repeated pans, and utilities for determining geoid centers of polygonal features. Further results of the pilot show that there is the potential to replace the connectors currently in use, enabling an enterprise keyword search engine (EKSE) to perform internal content crawls and ingest additional document types and to pass managed properties to the geo-tagger to enhance the search experience. The performance of remote crawling versus having search appliances physically located in data centers is also being evaluated against the constraints of limiting the content crawled from individual documents. The pilot project is designed to validate the ability of the geo-tagging tool to share an index with enterprise keyword search engines, and to use Application Programming Interfaces (API's) to provide the results of document ingestion and SQL-based structured data searches to both portal delivery systems and map-based “mash-ups” of search results.
The goals of the successful proof of concept stage were; to demonstrate that the geo-tagger could ingest text provided by the keyword search ingestion pipe, without having to duplicate the crawl of source documents; to use metadata from keyword search for document categorization such as product type, related people, or related companies; and to provide a metadata list of place names, confidence and feature types back to the search engine. The resulting demonstrated functionality moves towards providing “Enterprise Search with Maps”. The completed EDA project is sponsored by the head of exploration and will remove the current “prejudice of place” from global search results for approximately 250 geotechnical personnel for legacy data and information, in some cases dating back to 1960. The solution supports a corporate shift in focus from regional activity focused on projects and prospects with a 24 to 36 month timeline to move to global access that will no longer be biased toward locations with first world infrastructure, and eliminate the need for exploration personnel to take physical copies of large datasets into areas with high geopolitical risk. The corporate Infrastructure Services and Technology (IS&T) group is the main solution provider in the project with the ongoing responsibility for capacity, networking and security standards management. The deployed solution will have to support search across global to prospect scales, and roles including senior management, geoscience, administrative, data and information managers, research and business development. The focus is on a single window for data discovery that is fast and consistent, with components and roles for connected search and discover solutions. The entire solution will be compatible with the architecture used for the broader context of a discovery user interface and data layer for mineral exploration.
Further work identified during the Proof of Concept included developing strategies for documents already ingested prior to establishing the keyword search pipe, merging licensing models for the keyword and spatial search engines, and adding full Boolean search capability to the spatial keyword functions, In the current implementation, the user is supplied with a larger search result from the keyword search, while the spatial search returns only those documents with spatial content that allows them to be placed on a map. Conversely, the keyword results will receive place name metadata for searching, but will be limited in map capabilities. Identified benefits from the Proof of Concept were that separate collections of documents did not need to be built for the spatial search engine, the single crawler reduced load on the file repository, and additional connector framework development was not required. The next stage will validate a security model managing document security tokens in the ingestion pipe.
The baseline architecture was also validated during the Proof of Concept phase. In this architecture, the enterprise keyword search engine (EKSE) passes text from crawled documents individually to the enterprise spatial search engine (ESSE). The ESSE then extracts metadata and processes text using the Natural Language Processing (NLP) engine looking for geographic references. The ESSE passes back managed properties for locations, rating of confidence in location, and feature type (such as mining area, populated place, or hydrographical feature, and the GeoData Model (GDM) and Custom Gazetteer provide a database of place names, coordinates and features. The system is combined with an existing ESSE component licensed on production for 1 million geo documents, to be used with the geo-tagger stream processor. Geo-confidence results are being analyzed to evaluate the impact of misread characters from digital copies of documents produced through optical character recognition (OCR), and ambiguous character strings such as “tx” being an abbreviation for “transmission” in field notes for electromagnetic surveys as well as a potential spatial location (U.S. Postal abbreviation for Texas).
Recent technology partnerships have included providers of Web Map Services (WMS) that incorporate the idea of large amounts of static or base layer data (land boundaries, Geo-Referenced images and grids) overlain by dynamic operational data such as geophysical and geochemical interpretations. Other development strategies may include launching search in context from analytic applications, conforming to public OGC standards, using the “shopping cart” concept of commercial GeoPortals, and arranging spatial metadata and taxonomies along the lines of ISO content categories.
The pilot project team identified several achievements from the Proof of Concept phase. Documents ingested by the keyword search engine that had place name references were successfully located on the user map view. Categories passed from the keyword search such as source or company names were able to be searched in the spatial search engine as document metadata. Also, feature types and place names with location confidences were provided, appearing on the spatial search page as managed properties. The system will be enhanced in the deployment phase security implemented by passing access control lists associated with each document through the ingestion pipeline, and processing for replicated security in the spatial search engine. Improved presentation of returned managed properties will allow them to be managed for use as a refined list. Search categories can be selectable from an enhanced user interface to allow, for example, selection of a product type for search refinement. This will complement the current Boolean search parameters available in the map view.
The enhanced User Interface also presents the density of search results, the directory location of located documents, and the file type of the document. The map view also allows a more AustralAsia centric map experience by removing the arbitrary “seam” at the International Data Line (Longitude 180 degrees) so the region can be centered on a map.
The concept of “Enterprise Search with Maps” will be driven as part of the architecture of the Exploration Data Access project, and the level of integration may be impacted by decisions of future versions of the corporate portal. Next steps include
evaluating the relative costs and benefits of the enterprise licenses and how they are consumed and checked out during the crawl and display processes, the potential use of licenses for each active geo-tagged document versus the use of managed properties, direct indexing of spatial databases and geotechnical repositories using the keyword search engine, and security implementation. A third party application is also being used to scan and categorize doucments discovered with GeoTagging in order to extract and protect potentially sensitive information.
The finalized solution will provide a holistic search interface that allows geotechnical users to answer essential questions about both the structured and unstructured data in their enterprise, improving efficient access to mission critical data and reducing the risk of geotechnical decisions.