Chemistry-Enriched Patent Curation

30
Matthias Negri , PhD Scientific Information Center Boehringer Ingelheim Pharma GmbH & Co. KG Chemistry-Enriched Patent Curation semi-automatic analysis and elaboration of patents ChemAxon UGM 2015, Budapest, 20 May 2015 Árpád Figyelmesi ChemAxon

Transcript of Chemistry-Enriched Patent Curation

Page 1: Chemistry-Enriched Patent Curation

Matthias Negri , PhDScientific Information Center

Boehringer Ingelheim Pharma GmbH & Co. KG

Chemistry-Enriched Patent Curationsemi-automatic analysis and elaboration of patents

ChemAxon UGM 2015, Budapest, 20 May 2015

Árpád FigyelmesiChemAxon

Page 2: Chemistry-Enriched Patent Curation

Content

1. Chemistry in patents

2. Why do we need a patent curation workflow?

3. Semi-automatic Patent Curation Workflow - Overview

4. Linked tools/technologies

5. ChemCurator (ChemCC)

6. Semi-automatic Patent Curation Workflow – Step by Step

7. Lessons learned, weak-points, limitations

8. Outlook

2Negri Matthias, ChemAxon UGM 2015

Page 3: Chemistry-Enriched Patent Curation

Chemistry in patents

Chemistry appears within diverse form in patents:

1. TEXT - IUPAC names, common names, etc

2. IMAGES - embedded within or attached to the document

3. ATTACHMENTS (MOL/CDX)

4. TABLES

– as ONE-image file (tables with chemistry and bioactivity data)

– as chemistry-only image files embedded within table tags

5. Markush Structures/Formulas with R-groups

---------------------------------------------------------------------------------------

Currently NO commercial solution covers all these cases

Most of the cases are considered in the patent curation workflow

(Markush/R-group Formulas recognized and stored separately)

3Negri Matthias, ChemAxon UGM 2015

Page 4: Chemistry-Enriched Patent Curation

Why do we need a patent curation workflow?

Motivations:

1. Linked chemistry-retrieval from patents (+ chemistry as images)

2. IUPAC-enriched XML patent files as NEW source for text-mining

3. extraction of bioactivity data/targets/diseases/… in relation to chemistry

4. Similarity/Substructure frequency in compound sets of patents

5. …

4Negri Matthias, ChemAxon UGM 2015

Page 5: Chemistry-Enriched Patent Curation

Semi-automatic Patent Curation Workflow

Overview – current state

2 parallel branches

5

I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval

SLOWER & memory intensive vs BUT Higher Quality, More Control & IUPAC-enriched XML

FASTER vs LESS informative/flexible - ChemCC as the (near) future perspectiveINPUT

Negri Matthias, ChemAxon UGM 2015

Page 6: Chemistry-Enriched Patent Curation

Linked tools/technologies

1. KNIME/XPATH

2. ChemAxon ChemCurator (ChemCC)

3. Other ChemAxon tools in KNIME nodes (document2structure/d2s,

Naming, Molconverter, Structure checker, Standardizer, …)

4. Text/data-mining – Linguamatics I2E (+I2E Chemistry)

5. Optical Structure Recognition – Keymodule CLiDE Batch

6Negri Matthias, ChemAxon UGM 2015

Page 7: Chemistry-Enriched Patent Curation

Content

1. Chemistry in patents

2. Why do we need a patent curation workflow?

3. Semi-automatic Patent Curation Workflow - Overview

4. Linked tools/technologies

5. ChemCurator (ChemCC)

6. Semi-automatic Patent Curation Workflow – Step by Step

7. Lessons learned, weak-points, limitations

8. Outlook

7Negri Matthias, ChemAxon UGM 2015

Page 8: Chemistry-Enriched Patent Curation

Computer-aided chemical data extraction

English, Chinese and Japanese N2S

Markush Editor

Structure Checker

Hit visualization

Third party OSR technologies

ChemCurator (ChemCC)

8 Árpád Figyelmesi, ChemAxon UGM 2015

Page 9: Chemistry-Enriched Patent Curation

ChemCurator (ChemCC)

Name to Structure

Support for many nomenclatures (common, drug names, …)

IUPAC names

Custom dictionaries

English (2008)

Chinese (2013)

Japanese (2014)

9 Árpád Figyelmesi, ChemAxon UGM 2015

Page 10: Chemistry-Enriched Patent Curation

Compound Extraction View

Compound listProject explorer

Annotated document

Selected structures

ChemCurator (ChemCC)

10

Page 11: Chemistry-Enriched Patent Curation

Markush Extraction View

Markush editor

Example structures

Annotated document

Project explorer

Selected structures

Structure checker

ChemCurator (ChemCC)

11

Page 12: Chemistry-Enriched Patent Curation

General Document Curation

Extract Markush Structures from patents

Extract specific structures

Journal articles

Company reports

Patent examples

Structure extraction wizards

Exclude fragments, chemical elements, etc.

ChemCurator (ChemCC)

12 Árpád Figyelmesi, ChemAxon UGM 2015

Page 13: Chemistry-Enriched Patent Curation

ChemCurator (ChemCC)

Integration & Information Sharing

Other ChemAxon products:

Direct IJC schema connection

Project sharing function

Accessible from Plexus, IJC, etc.

Third party tools:

Standard file formats

Export functions

Easily processable projects

13 Árpád Figyelmesi, ChemAxon UGM 2015

Page 14: Chemistry-Enriched Patent Curation

Content

1. Chemistry in patents

2. Why do we need a patent curation workflow?

3. Semi-automatic Patent Curation Workflow - Overview

4. Linked tools/technologies

5. ChemCurator (ChemCC)

6. Semi-automatic Patent Curation Workflow – Step by Step

7. Lessons learned, weak-points, limitations

8. Outlook

14Negri Matthias, ChemAxon UGM 2015

Page 15: Chemistry-Enriched Patent Curation

Semi-automatic Patent Curation Workflowa) input sources and b) bibliographic data

a) Input sources

files with patent-IDs list

XML collection

b) Retrieval of bibliographic information and attachment data

family ID, patent references, expiration date, etc

Attachment files MOL/CDX (US-patents only), TIF files

….

15Negri Matthias, ChemAxon UGM 2015

Page 16: Chemistry-Enriched Patent Curation

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

1. ChemCurator branch

data retrieval (XML, attachments) from IFI Claims Direct BI-server

ChemCurator project creation/sharing/annotation html output

Chemistry extraction name2structure/document2structure sdf output

Generation of pre-annotated patent set stored as ChemCC projects

Faster, but lower quality within the chemistry extraction process

16Negri Matthias, ChemAxon UGM 2015

Page 17: Chemistry-Enriched Patent Curation

2. KNIME branch

- OCR-errors CLEAN-UP in KNIME improved chemistry recognition

- MOL/CDX/TIF - standardizer, structure checker filter formulas, solvents, R-groups

Higher quality and more control in chemistry extraction process

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

17Negri Matthias, ChemAxon UGM 2015

Page 18: Chemistry-Enriched Patent Curation

2. KNIME branch

MOL IUPAC

CDX IUPAC

TIFF (via CLiDE) IUPAC

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

18Negri Matthias, ChemAxon UGM 2015

Page 19: Chemistry-Enriched Patent Curation

Merging and Comparison of the converted chemistry output of MOL/CDX/TIF – 2 “quality” checks

IUPAC

string length (different output order of chemicals in multiple molecules image/multiMOL files

OCR-correction (“dictionary” based)

2. KNIME - Chemistry “Normalization”

(within KNIME) set up a relation between each TIFF/attachment file

1. to (one or more) IUPAC name(s)

2. to a position/section in the text/document

Semi-automatic Patent Curation Workflowc) chemistry retrieval/extraction/filtering

19

Merge IUPAC Clean-Up IUPAC

If NO IUPAC IMG-name is set

“Normalize” IUPAC names

Negri Matthias, ChemAxon UGM 2015

Page 20: Chemistry-Enriched Patent Curation

Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names

Chemistry present as text is recognized and extracted either via

- Textmining (I2E chemistry – d2s is working in behind) or

- Within KNIME/ChemCC using annotate/molconvert

Replacement:<chemistry> vs IUPAC

IUPAC-enriched XML

20Negri Matthias, ChemAxon UGM 2015

Page 21: Chemistry-Enriched Patent Curation

OCR-errors in chemical names

Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names

TIF

CDX

MOL

This text-chunk is replaced by the IUPAC name

21Negri Matthias, ChemAxon UGM 2015

Page 22: Chemistry-Enriched Patent Curation

XPATH/XML parsing and extraction of:

Tables

Rows - XML tags & strings

Entries - XML tags & strings

Semi-automatic Patent Curation Workflow e) Bioactivity/tabular data extraction with KNIME/XPATH

22Negri Matthias, ChemAxon UGM 2015

Page 23: Chemistry-Enriched Patent Curation

IUPAC-enriched XML as source for I2E API/textmining

indexing

pre-defined queries

results retrieval

saved as SDF files (KNIME)

Semi-automatic Patent Curation Workflow f) Text-/datamining with Linguamatics I2E via KNIME

Text-mining retrieved (chemistry-related) information

Example Nr.

Bioactivity data from tables

Claims, regions where chemistry appears in patents

Genes, diseases

23Negri Matthias, ChemAxon UGM 2015

Page 24: Chemistry-Enriched Patent Curation

1. Example Nr. – IUPAC

Table:Image:

For comparison – chemistry in PDF:

Semi-automatic Patent Curation Workflow f) Bioactivity Data using I2E multi-queries – 2 steps

Source: (IUPAC-enriched) XML

2. Example Nr. – Bioactivity data

24

IUPAC

Bioactivity

Example Nr.

Page 25: Chemistry-Enriched Patent Curation

Semi-automatic Patent Curation Workflowg) Visualize data-/textmining results in ChemCC

SDF file imported into ChemCC project + automatic mapping to existing chemistry

25Negri Matthias, ChemAxon UGM 2015

Page 26: Chemistry-Enriched Patent Curation

Lessons learned, weak-points, limitations

1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch

chemistry check/normalization – 3 input sources improved quality

improved chemistry recall - ALL images (incl. tables and drawings)

More filtering options in KNIME workflow vs ChemCurator only

IUPAC-enriched XML as new source for I2E

Advantages ChemCC vs KNIME Full-Mode (MOL/CDX/TIF)

faster

Image processing using CLiDE is already incorporated with naming

26Negri Matthias, ChemAxon UGM 2015

Page 27: Chemistry-Enriched Patent Curation

Lessons learned, weak-points, limitations

2. No full automation of the workflow due to lack of homogenicity in patent data (US vs WO, EP, etc..)

Missing attachment files

No tables present in XML

Error rate in chemistry recognition (OPSIN vs n2s/d2s)

NEEDS: different workflows/branches, patent-files clean-up (OCR)

3. Time & Computational Resources-consuming process

27Negri Matthias, ChemAxon UGM 2015

Page 28: Chemistry-Enriched Patent Curation

Outlook

1. KNIME Workflow

Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..

Usage of ChemCC html output as source for textmining

Ontology mapping

Expand workflow by including other sources (internal PDF, literature full-text)

Use KNIME to interconnect to BI-intern workflows, DB, etc

chemistry-linked information in a patent-DB improved (semantic) search

28Negri Matthias, ChemAxon UGM 2015

Page 29: Chemistry-Enriched Patent Curation

Outlook

2. ChemCurator

Improved n2s

New command-line functions

Complex-phrase requests from IFI server

Improved SDF import

Preprocessing wizards

Árpád Figyelmesi, ChemAxon UGM 201529

Page 30: Chemistry-Enriched Patent Curation

Thank You !

Negri Matthias, ChemAxon UGM 2015 30

INPU

T

Árpád Figyelmesi, ChemAxon UGM 2015