Title : Terminology spectrum analysis of natural-language chemical documents: Application on catalysis
This study is devoted to a methodology for automatic extraction of a complete set of ‘term-like phrases’ and to create a terminology spectrum from a collection of natural language PDF documents in the field of chemistry and catalysis. The ‘term-like phrase’ is one or more consecutive words and/or alphanumeric string combinations with unchanged spelling which convey specific scientific meaning. A terminology spectrum for a natural language document is an indexed list of tagged entities including: recognized general scientific concepts, terms linked to existing thesauri, names of chemical substances/reactions and term-like phrases. The retrieval routine is based on n-gram textual analysis with a sequential execution of various ‘accept and reject’ rules with taking into account the morphological and structural information.
It is suggested to use such terminology spectra to perform various types of textual analysis across document collections. Terminology spectrum may be successfully employed for text information retrieval and for reference database development. For example, it may be used to develop thesauri, to analyze research trends in subject fields of research by registering changes in terminology, to derive inference rules in order to understand particular text content, to look for the similarity between documents by comparing their terminology spectrum within an appropriate vector space, to develop methods to automatically map document to a reference database field.
For instance, if a set of documents contains texts from different time periods, the analysis of textual and absolute frequencies of occurrence will allow to follow up the “life cycle” of each term-like phrase on the quantitative level (term usage increasing, decreasing and so on). That gives unique capability to find out research trends and new concepts in a subject field by registering changes in terminology usage in the most rapidly developing areas of research. Moreover, similar dynamic of change over time for different terms often indicates the existence of an associative linkage between them (e.g. between new process and developed catalyst or methodology). Indicator words or phrases such as “for the first time”, “unique”, “distinctive feature” and so on may be also used in addition to detect in texts, for example, the new recipes or catalyst composition for the explored process.
The assessment of the retrieval process, expressed quantitatively with a precision (P), recall (R) and F1-measure, is made with using the text abstracts belonging to several catalytic conference events.