Issue 4, 2022

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

Abstract

The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (∼1.7 M PFAS from patents and ∼27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.

Graphical abstract: Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

Article type
Paper
Submitted
19 Mar 2022
Accepted
31 May 2022
First published
31 May 2022
This article is Open Access
Creative Commons BY license

Digital Discovery, 2022,1, 490-501

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

S. J. Barnabas, T. Böhme, S. K. Boyer, M. Irmer, C. Ruttkies, I. Wetherbee, T. Kondić, E. L. Schymanski and L. Weber, Digital Discovery, 2022, 1, 490 DOI: 10.1039/D2DD00019A

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements