Bio <-> Chem

Updates: the unmapped code names have been appended to the end of this post for crowdsourcing, a subsequenct post has addded in the AZ/MRC compound list and we now have a paperout on the combined results.

**************************************************

Interest in the NCATS set of 58 compounds for academic repurposing has recently been invigorated by two slide presentations at the Philadelphia ACS meeting, blog posts and some personal e-mail contacts. It turns out I was neither alone in pointing out the problems associated with project tendering for blinded clinical candidates nor in expending some effort in trying to map the names to structures (the other groups are mentioned in the slides from Antony Williams and the blog from Sean Ekins). Suitably inspired, I managed to track down three more name > strucs, as shown in the image hits below.

While OSRA did well on some previous images it only picked up a ring or two for these three new cases so I actually had to sketch them. The "orphan" provenance of a Taiwanese chemical supplier for the JNJ39393406 structure is interesting and somewhat unusual (did they pick it from SciFinder perhaps?). I could find no corroboration for the SMILES output from the sketcher (C1=CC(=CC2C1OC(O2)(F)F)NC3=N[N](C(=N3)C4=CC=NC=C4)CCC(N(C)C)=O) because this had no exact matches or high similarities in PubChem or SureChemOpen. However, the information supplied by Janssen to NCATS specifies the compound as a "positive allosteric modulator at the nicotinic α7 receptor" and the closest match in PubChem is CID 24850110 (below).

Not only does this look like a plausible analog of the vendor structure but it also has a SureChemOpen exact match to US-20090253691-A1 from Janssen, where the abstract quotes; "invention particularly relates to positive allosteric modulators of nicotinic acetylcholine receptors". Low and behold browsing the PDF revealed the vendor structure as compound 33 on page 40 (below).

This is listed with a pEC50 of 6.2 as mid-potency withing the range covered in the large SAR table on page 74. The interesting corollary here is that SureChemOpen has not yet completed their image extraction back-fill so this is likely to be dropped-in eventually (see patent mining section below). So there we have it, ....possibly. If anyone from Janssen is prepared to corroborate the identity of JNJ39393406 I would be pleased to acknowledge this in an update. Note that it is arguably more important for them to do this if the vendor structure to-code name assignment is wrong, rather than right ! (see Live-chemical-structure-blogging). Below I have included my revised identification list (now with thee more structures than the previous post) as brief provenance descriptions with PubChem CID links.

JNJ-39393406 sketched from vendor entry, analog is CID 24850110	C1=CC(=CC2C1OC(O2)(F)F)NC3=N[N](C(=N3)C4=CC=NC=C4)CCC(N(C)C)=O
CP-945598 otenabant	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10052040
LY500307 (Erteberel?)	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10286159
AZD0530 saracatinib	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10302451
JNJ-18038683 Chemicalize supp.data from PMID:22570363	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11151899
PF-04136309 = PF-4136309	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11192346
AZD1981 (TTD miss-map to CID 5311037 ) OSRA via PMID: 21944852	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11292191
CE-326597 OSRA via PMID: 21493064	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11541667
AVE8134 PMID: 22212431 > OSRA	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11625114
CE-210666 OPSIN http://issx.confex.com/issx/15na/webprogram/Paper11299.html	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11697831
PH-670187 dermaciclane, EGIS-3886	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=119590
SB223412 talnetant	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=133090
SAR115740 , sketched from image in PMID: 19063991	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=15984196
AZD1656 OSRA from http://www.citeulike.org/user/cdsouthan/article/10861475	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16039797
PF-03654746 chemicalized from PMID: 21928839	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16119086
ABT-089	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=178052
HMR1766, ataciguat	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=213037
PF-05416266, senicapoc , ICA-17043)	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=216327
AZD7325 Chemcalize from PMID: 22122233	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23581869
GSK1004723 (TTD-miss-map to famitodene CID 5702160 ?)	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24803482
PF-04191834 OPSIN from PMID: 20378715	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24986635
GSK835726 (Mesh + TTD, TFA salt is CID 16219413	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5311268
PF-00913086 prinaberel ERB-041	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5326893
PF-03463275 = PF3463275 PMID: 20186106 > MeSH > OPSIN	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=56657376
AVE5530 canosimibe	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=56841608
CP-448187 elzasonan (Cl salt is CID 6506051)	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=6914152
AZD0328	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9794392
GW274150	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9797017
SSR149744C celivarone	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9807128
AZD3355 lesogaberan	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9833984
BMS-562086 pexacerfont	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9884366
ZD4054 zibotentan	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9910224
AZD2171 cediranib	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9933475

To consolidate this update I searched each of the CIDs against SureChemOpen. This was done with the canonical SMILES string, starting with exact matches but, if these were negative, backing-off to a similarity search. The links presented below are generally the oldest and presumed first publication. Note that for the three older compounds with INNs they have been named as prior art and mixtures in 100s of patents. It should be possible to use date cutting in SureChemOpen to find the earliest filings as IUPACs or image-extracted structures (but I can't be bothered just now). The other thing I have not done is check each publication to see if the presumed assignee, target, SAR data etc, tally with those in the NCATS PDFs, but the ones I glanced at seemed to fit (anyone interested in details can contact me). Note that 30 out of 33 patent whacks is not bad going and indicates, at least for this set, most structures have been exemplified and sucessfully extracted rather than being specified only in a Markush nest. The results are listed below.

CID		Patent match
10052040		https://open.surechem.com/en/document/EP-1890767-A2/
10286159		https://open.surechem.com/en/document/EP-1626974-B1/
10302451		484 patent hits
11151899		https://open.surechem.com/en/document/WO-2005040169-A2/
11192346		https://open.surechem.com/en/document/WO-2005060665-A2/
11292191		https://open.surechem.com/en/document/WO-2004106302-A1/
11541667		https://open.surechem.com/en/document/US-20110130365-A1/
11625114		https://open.surechem.com/en/document/US-6635655-B1/
11697831		https://open.surechem.com/en/document/WO-2005090300-A1/
119590		https://open.surechem.com/en/document/EP-0694299-A1/
133090		1650 patent hits
15984196		https://open.surechem.com/en/document/US-20080125459-A1/
16039797		https://open.surechem.com/en/document/WO-2007007041-A1/
16119086		https://open.surechem.com/en/document/WO-2007049123-A1/
178052		https://open.surechem.com/en/document/WO-2000042044-A1/
213037		https://open.surechem.com/en/document/WO-2000002851-A1/
216327		https://open.surechem.com/en/document/WO-2000050026-A1/
23581869		https://open.surechem.com/en/document/US-20070142382-A1/
24803482		https://open.surechem.com/en/document/WO-2007122156-A1/
24986635		https://open.surechem.com/en/document/US-20080125474-A1/
5311268		https://open.surechem.com/en/document/WO-2002096934-A1/
5326893		no hits down to 0.8
56657376		https://open.surechem.com/en/document/US-20060229455-A1/
56841608		https://open.surechem.com/en/document/WO-2002050027-A1/
6914152		https://open.surechem.com/en/document/EP-1451166-A1/
9794392		https://open.surechem.com/en/document/WO-1999003859-A1/
9797017		no hits down to 0.75
9807128		https://open.surechem.com/en/document/US-20030225100-A1/
9833984		https://open.surechem.com/en/document/WO-2001042252-A1/
9884366		https://open.surechem.com/en/document/WO-2002072202-A1/
9910224		https://open.surechem.com/en/document/WO-1996040681-A1/
9933475		450 patent hits including deuterateds

The utility of patent mapping (n.b. there are additional open patent links via PubChem sources for some of these entries) in the context of in silico and/or in vitro investigations on these compounds is at least threefold. Firstly, some may include substantially larger SAR data sets (e.g. IC50 tables) than were eventually included in journal articles. Secondly, they may include other unpublished biological and/or ADMET data. Thirdly, analogs that are very useful (essential even ?) for a range of comparative investigations, will not only have their synthesis routes described, but also, one might assume, in cases where the NCATS proposals have been approved, that the companies concerned could donate them.

We can compare the current small-molecule efforts as outlined in the Collaborations-to-get-the-ncats-library-of-industry-provided-reagents post, where it is reported that Chris Lipinski found 36 (via SciFinder, Thomson Reuters Integrity and web searches) and Tudor Oprea et al., 41 (via IBM US Patents database, Google and publications). This leaves me trailing in third place with 33 structures but note that no commercial databases were used and some relevant publications were not on the Göteborg Universíty Library subscription list. I did receive some useful comments on my original post including the Google images trick.

There are a lot of interesting corollaries to all of this but I shall just introduce some brief ones here (they also depend on intersecting the three sets to determine concordance). The first is it would be useful to know what the sources were for the three or more mappings that I "missed" but were presumably explicitly curated in SciFinder and/or Thomson. The reason is that these products, comprehensive as they are, cannot (I presume) disclose proprietary mappings even via a company CDAs because their content is licensed to many users (~ 0.3 million globally?). Thus any code-name-to-struc they capture has to have a public primary source (including subscription publications) even if this is just a meeting poster or slide image that never got Google crawled.   The only possible exception I can think of is where CAS may be in possession of a code-name-to-struc as a necessary prerequisite for an INN and/or USAN application, but presumably it cannot disclose to users until the WHO PDF has appeared. The second corollary is code-name-to-struc occurance in patents. This is unlikely to be in first-filings because the identity of the eventual clinical candidate (that they may not have selected or given a development code to at filing time anyway) is exactly what applicants generally try to obfuscate but also exemplify and claim as an IUPAC. Code names can thus only be back-mapped to structures in the early filings (as in the list above). I have come across code names with their associated IUPACs in patents but these tend to be associated with later filings of formulations or combinations and not the first disclosure of a code-name-to-struc.

Last but not least, here are the sharing bits:

1) The links above should be live (but you will need the free SureChemOpen sign-up for the patents)
2) The complete Excel sheet is available for download at http://figshare.com/articles/NCats_Compounds_with_identifications/92850
3) You can now "View my collection, "32 NCATS CIDs" from NCBI". If you open these up there is a lot of information in the consitutive filters on the right hand side, including 15 active in assays and 15 available from vendors. Note also you can save this to your own MyNCBI, perform a range of analyses with the PubChem toolbox and download the structures as a set of SD files or any other format.
4) As a test, I have submitted one new synonym to PubChem in the form of AZD1656 inSID 136946384. I may do more but I am awaiting imminent enhancements to their submission system and I would also prefer to eventually do this collaboratively, so the mapping provenances can be independently corroborated (perhaps even by the companies concerned?) before they become enshrined in the PubChem synonym compilations.

Addendum 25 Aug. Those small-molecule codes I have been unable to map or remain equivocal are pasted below (but note other parties may have dug some of them out). If anyone can resolve any of these from declarable sources (but not necessarily be personally held to their provenance, unless they were the project leader or portfolio manager!) they are most welcome to post such new information (e.g. even just a pointer to an image) and thereby be attributed for extending the mappings. Ideally they could add a comment to this blog post but any open channel would do.

ABT-639
LY2828360
SSR150106
AZD2423
JNJ-39269646
PF-05190457
BMS-820132
ABT-288
PF-04995274 publication links are PubMed -ve
LY2590443
SD-7300/SC-81490 referenced in PMID 20726512 but points to SC-78080/SD-2590
AZD1236   possibly in PMID: 21624491 but TTD-only mapping to CID 56603698
BMS-830216
AZD5904 (TTD-only mapping to CID 177992 )
SAR103168
CP-601927,   CP-601,927 possibly in PMID:     21594972
SD-6010 (SC-84250) assuming SC-842 possibly in PMID: 17672879
AZD7268
AVE0847
PF-05019702 (PRA-27) = WAY-257027
AZD9056 possibly in PMID: 21440623
LY2245461
SSR97225

A number of things came into conjunction recently that inspired me to embark on my first WikiGenes edit, including the fact it was now being included as a cross-reference in key resources such as Wikipedia, HGNC and iHOP (maybe UniProt next ?). I have no inclination to check and/or contribute to all the protein records I have had some engagement with, but the choice for my first effort was LACTB . The reasons include confusion around the name that came up in a recent post (Our-human-beta-lactamase-is-not), I had had some discussions with the HGNC team some years ago about "symbol hijack" (see below) and also tracked some new and interesting literature connections, including the gene becoming a weeny-bit famous by being named as a validated novel obesity locus in a 2008 Nature abstract (PMID 18344982).

I was half way through the edits when I realized it would have been instructive to archive what the “bot” had managed to populate the proto-entry with but you can get an inkling of this from the mouse Lactb entry. It’s a bit of a dogs breakfast, but encouraged by the Wiki rallying declaration “Homo sapiens can do much better” (not to mention being a member of the International Society for Biocuration) I set to work on the human entry as best I could to make a useful but concise entry. A snapshot of the result is below.

I encountered a surprising variety of false-positives and false-negative problems, just a few examples of which I shall relay here.

Primary biochemical function. So what does this protein actually do? In this case we don’t really know. The closest data support we can get is filament formation in mitochondria. However, there are also false-positives – things the protein probably is “not” but that database annotation and or cross-reference says it “is”.

1)   LACTB is probably not a mitochondrial ribosome componant. This statement should not be taken to indicate the reported identification of a compendium of such proteins was not a good paper (PMID 11551941). However, low-abundance protein complex isolation and MS tryptic peptide profiling will invetiably include a sprinkling of co-purification artefacts. As we know, it is always perilous to pronounce absolutes in biochemistry, but, in addition to the simple fact it’s just too darn big, the balance of positive evidence (e.g. the fibrils and complex 1 association), along with the absence of indpendant corroborative data, would not support a conclusion that LACTB is a constitutive mitochondial ribosomal subunit. Notwithstanding the eventual revision by HGNC (ID 16468) after I had pointed this out (LACTB > MRPL56 > LACTB) , the officially deprecated synonym, MRPL56, seems likely to persist. With nearly 8K hits for "mitochondrial ribosomal protein L56" including the NCBI RefSeq and Entrez Gene entries, this could be forever.

2)    The Achilles heel of homology-based automated annotation is exposed when implicit transitivity of biochemical function, even where high similarity scores unequivocally establish common ancestry, can be abrogated by sequence drift. For example, at least 12% to 15% of human proteases are “dead” in the sense of probably being catalytically incapacitated (or at least, not having significant protease activity in vivo) by relatively minor sequence changes. Mammalian LACTB falls into this category, even though there is no traceable publication where a negative experimental result supports the inference (maybe no-one has even bothered to test it ?). The Entrez Gene curation (despite erroneously promoting MRPL56 to primary function!) actually picks this up viz “The encoded protein has some sequence similarity to prokaryotic beta-lactamases but most of the residues that are responsible for the beta-lactamase activity are not conserved between the two proteins”. Notwithstanding, the Swiss-Prot record explicitly includes; peptidase S12in the sequence family comment line, in the GO keywords we find “Molecular function - hydrolase activity, inferred from electronic annotation. Source: UniProtKB-KW”, and the MEROPS cross-reference that implicitly classifies it as an active peptidase but explicitly assigns "putative" to the mouse sequence.

Cross-References, good, bad, equivocal, persistant, and missing. The fidelity of cross-references in database entries has always been important for scientists inspecting, reading and cogitating, while navigating between records and the literature. However, it is now also becoming a crucial determinant of the eventual linked big data quality and mining results. We can look at a few examples from this protein including differences between the 12 PMIDs curatorially aligned to the Swiss-Prot entry and 16 to the RefSeq. Five were unigue to RefSeq unless you toggled in the four extra "computationally mapped citations" on the Swiss-Prot side (I'm pleased to say our orginal paper, PMID 11707067 was on both sides)

1) As mentioned, the HGNC and Swiss-Prot curators (but its an interesting question if either deffer to the other) have now pitched against MRPL56 as primary function. However, the cross-reference from which this was inferred stays in. But by what criteria ?

2) I find historical papers on high-throughput secreted protein discovery wryly amusing, partly because it was such a rolling bandwagon with everyone hoping to find the next EPO and also because I jumped on said wagon with a proposal to sequence proteins from blood plasma fractions. By any stretch of the imagination LACTB is not likely to be a secreted protein. Notwithstanding, it got picked up in the (slightly) famous Genentech "secreted protein discovery initiative" (PMID 12975309) that is forever enshrined as a cross-reference for more that a few retrospective false-positives. My guess is that the mitochondrial targeting sequence scored above the signal peptide prediction cut-off used in the study.

3) The "beta-lactamase reporter assay"(PMID: 17517902) is a curious false-positive because as a GeneRIF it should have been a considered third-party curation submission. It looks like someone just got their lactamases mixed up because LACTB cannot provide an enzymic read-out if it's dead.

4) There is an example of a "missing" but significant false-negative for both UniProt and RefSeq that happens to be PubMed-negative. This was "A mitochondrial protein compendium elucidates complex I disease biology" (PMID 18614015) . I can't remember my own provenance for picking this up, other than the fact that one of the senior authors is an ex-colleague of mine, so I'll give the credit to the Mouse Genome Database for their Lactb entry. But, you might say, this was a mouse paper and therefore should not be transitively, cross-species assigned ? OK, but in this case the mouse mitochondrial experiments are plausibly extrapolatable so I have included these in my WikiGenes entry. In this aspect, even the major databases can be inconsistent, since the human MRPL56 symbol was assigned transitively from bovine data.

So where does this leave us ? Related to these Wiki curation efforts I encountered at least four universally confounding issues for protein annotation head-on. Firstly, why can't curators be empowered to overrule the annotation bots ? They could remove hydrolyase or peptidase from the Swiss-Prot entry and GO tree for this protein but, if anyone does prove it turns over substrates, they can always put them back in. Secondly, once they enter the global pipelines by any route there appears to be no mechanism to purge false-positive cross-references that are (in most cases) easy for an expert to recognise. These become perpetualy enshrined across all databases. Thirdly, it is becoming increasingly difficult to discriminate between primary, independant and de-novo expert curated cross-referencing, where this is secondary (i.e. simply being shuffled between databases) or even automatic (i.e. the output is parsed but never eyeballed by an expert). In the grand round this confusion can become dangerously circular. Fourthly, the challenge of species annotation transitivity (typically human < > rat < > mouse) seem close to impossible to resolve, even if only from the understandable predominance of mixed data (e.g. human phenotype associations tested via mouse KO). I would argue for making human protein annotation a bit more apex for this transitivity but fully acknowledge the caveats (i.e. in this case mouse mitochondria are a lot like human ones but not identical).

By the way, if anyone finds my WikiGenes entry useful I'd be pleased to know. I might then even be inspired to fix the Wikipedia one.

Well, would you Adam ’n Eve it ? Hot on the heels of the SureChem announcement of their pending big PubChem submission for 4Q, 12 million searchable structures in SureChemOpen, and 0.28 million fresh links from chemcalize.org (more on that in a subsequent post) the deposition from SCRIPDB that opened up in over the last week now connects patent-extracted structures to just a tad under 4 million CIDs. The big-three patent sources in PubChem now stand as shown in the Venn below (the lead-lead like filter is simply ROF plus a 350-800 Mw cut).

The large union and low intersects imply that each of the sources has substantially unique content (with respect to the other two) on the basis of CIDs. There are some partial explanations for these differences and the usual technical caveats associated with exact matching rules. While a detailed analysis would be needed to corroborate these we can consider some aspects here (I've done my best in this short space but if any details are incorrect or in the process of changing I would be pleased to amend them accordingly).

The Thomson (Reuters) Pharma source (TRP) is odd-man out in this triumvirate for six good reasons and one less-good one.

1) extraction is done manually by experts
2) includes a proportion from the literature,
3) patents from Derwent World Patent Index are selected as pharmacutically relevant and the exaction focus is on somewhere around 10 to 100 key examples per-patent,
4) document selection is non-redundant in the sense that typically only one family member is selected for extraction,
5) the covered patent authorities with time span of capture is extensive and updated to the extent of ~ 5000 new CIDs being deposited each weak (giving this one of the highest de-novo growth rates of any PubChem source)
6) image-only examples can be extracted manually along with IUPAC-only
7) the less-good feature is that non-subscribers (to the TRP web application) cannot pick up document identifiers or links from the PubChem SIDs and therefore neither from most of the 3 million CIDs that are not in the other two sources (but popping the structure into SureChemOpen gives you good chance of a patent hit and its worth checking for a ChEMBL and/or chemicalize.org intersect as well).

For comparing with the other two sources we can start with the (IBM + SCRIPDB) points in common

1) extraction is automated
2) processing pipelines are kind-code and patent classification agnostic (i.e. they extract equivalent patents and all non-pharmaceutical patents with chemical content)
3) the primary source is largely USPTO,
4) they both extract CWUs
5) patent document IDs and links are provided in SIDs and CIDs (although the SCRIPDB document links are still being bedded in just now)
6) both collections are frozen to pre-defined document dates,
7) no chemistry selection within document sections is possible, at least from the CID side (although this abstract suggests document location indexing may become available for the IBM set at a different open portal)
8) both depositions are sub-sets of super-sets of ~ 11 million (explicit for SCRIPDB and more of a a guess for IBM)

Differences include the following;

1) SCRIPDB has an open web portal for searching stuctures and documents directly, including metadata associated with synthesis and other features, whereas the IBM data is a sub-set of a subscription product
2) the IBM time slice ends at 2000, SCRIBDB covers 2001 though 2010,
3) SCRIPDB has focused on the processing of Complex Work Units (CWU's) for the extraction of structures including those from images (see PMID 22067445 for details).
4) The IBM filtration factor (extracted structures > CIDs) is unknown but the SCRIPDB filtration, plus the PubChem standardization rules, was stringent in the sense of being ~ 11:4 (million)
4) For reasons I cant work out just yet (probably related to document mapping) SCRIPDB has an SID:CID ratio of ~ 6:4 wheres the other two sources are close to 1:1 because they are compound centric (one structure > many documents)
5) A proportion of IBM structures have also been mapped to PubMed IDs. Most of these are in addition to patents but I came across one or two that are PubMed-only

Before discussing differences in terms of content there is a quick piece of data slicing we can use to assist the interpretation. After source and filter selection I performed some Boolean's on the search history as shown below.

Those interested in the details can read them off (or contact me) but the essence is as follows. Each source is sliced first by unique content in the whole of PubChem (i.e. single-source) and a second slice by single-structures (i.e. one covalent unit). Mixtures in the unique component of each source can thus be read off as 10% for IBM. 9% for TRP and 5% for SCRIPDB. However, the TRP uniqueness measurement of ~10% is confounded by circularity because Discovery Gate was matching the common DWPI content 1-for-1 between 2007 and 2011 (but has since ceased depositing). Notwithstanding, SCRIPDB is ~ 70% unique and IBM ~ 50%.

While it forms a useful check, the salts and mixtures aspect would seem insufficient to explain the big differences. There are a number of blog-worthy puzzles arising from these and other metrics that could be contemplated (or numerology as an ex-AZ colleague calls it). I will just start with a couple but let me first make clear that this should not be interpreted as arcane or invidious source comparison. It is about attempting to understanding the selectivity of different extraction pipelines and what this means for their relative value and utility (i.e. which horses for which courses).

The top puzzle for me is the low overlap between TRP and the others (23%) . While some of this could be accounted for by the literature extraction content, the minimum parsimony hypothesis would be that both automated pipelines are missing a substantial part (up to a million or more ?) of the pharmaceutical example structures selected by expert curators. This would need corroboration against // extracted and manually benchmarked corpora, but, as is often the case in patent informatics, we continually come up against basic questions for which there is no declared data. One of these is; what proportion of examples from pharmacutical patents are image-only structures without a corresponding IUPAC exemplification? If this were high it would be a partial explaination for Thomson-unique content but, we might have expected SCRIPDB to captured some of the same images via the CWUs.

Puzzle number two is the unique content of SCRIPDB that, in absolute terms of ~2.8 million, is currently the not only one of the highest of any source in PubChem but is also responsible for the CID total going north of 35.5 million. Some portion of this (not in TRP) should be from non-medicinal chemistry patents. Here again the pharma:non-pharma (but including academic filings) extractable chemistry ratio from the patent corpus is unknown. Some commercial databases should be able to do a C07D-vs-the-rest structure ratio but I have never come across such results. While maybe a good chunk is non-pharmaceutical (but not cleanly even on this code) the similar lead-like properties would argue against this.

Number three is less of a puzzle (because it can be examined via further slicing and dicing) but still slightly surprising The patent union of 9.2 million intersects with the literature-extracted ChEMBL 760,889 set at 45% (342,604). The fact that most of this (293,393) comes from TRP could be the combination of a) Thomson's literature extraction, b) a lot of prior-structures appearing in patents formulation and combination patents c) the obvious patent-then-publish contribution. The good news here is that (via IBM or SCRIPDB SID links or a SureChemOpen pop) many more structures from papers can now be back-mapped to patent numbers from CIDs. Thus is useful because the citation of patents in medicinal chemistry papers is generally thin (another statistic that no-one knows?). Note also that these absolute CID intersects are big underestimates because ChEMBL at-source has an additional 452,350 structures brought in from confirmatory BioAssays.

Just for the record, the low overlap between IBM and SCRIPDB is not unexpected because the document set is nominally non-ovelapping by one year. However, the fact that patent family publication dates can span many years, together with what could be termed prior-art and common chemistry, might account for the 0.2 million in-common.

If anyone has data-supported insights related to these puzzles and/or can suggests further tests that could be done with the PubChem toolbox (I can think of a few already), comments are welcome. It should also be noted that, regardless of any head-scratching this numerology might induce, low overlap per se is good for the complementary utility of these sources in PubChem as a whole. I will also take the opportunity to make an open guess. When the SureChemOpen set arrives in PubChem I predict it might straddle these three fairly evenly in terms of compounds-in-common but will also have significant unique content.

My last slice for this post takes a look some of the more exotic non-pharma stuff that comes through the auomated systems. From "IBM and SCRIPDB but not TRP" we get 135,459 CIDs. Rank these by Mw and we get the visually striking mix below.

Opening up one of the more spidery-looking entries a few pages in gives us CID 21305366 below.

This CID opens out the the 5-SID set below.

From SID 143079746 via SCRIPDB we get to "Thermally hardenable polymer binding agent in the form of a powder" as US6716922 from BASF Aktiengesellschaft. Sure enough, down in the description pages we can find what looks convincingly like the CID.

There are at least two quirks related to this entry. Firstly, it looks like an image extraction. Secondly, the ChemSpider entry (that may have originated in a pre-2007 SureChem extraction) is deprecated but independently supported by the other two sources.

Update: two comments below the post.

Since I have been helping out with database searches for the OSDD antimalarial work in Sydney I was interested to see that recent press releases concerning MMV390048 had been picked up at In the Pipeline. It took about a minute (OSRA did the image > struc off the bat in this case, no editing needed, probably due to the nice thick lines) to connect this compound to CID 53311393

This is a Thomson Pharma-unique structure and thus probably a patent extraction (just for the Google bots and possibly a little extra to this blog I'll pop in the InChIKey RTJQABCNNLMCJF-UHFFFAOYSA-N). It then took about another minute to past the SMILES over to SureChemOpen and find the single exact whack to WO-2011086531-A2 published on 2011-07-21 as "New Anti-malarial Agents" with Medicines For Malaria Venture as assignees. The next step was a bit more tricky because, altough the structure could be toggled through 4 instances in the document, there were no example numbers in the text. It turns out these numbers are in the images but using the useful images-in-line PDF mark-up at the WIPO site for WO2011086531 means we can at least map things together. One route is via the Mw from SureChemOpen > image > SAR table (as below).

There are over 80 EC50 values in table 5 and it looks like SureChem have extracted all the examples as IUPACs (there should be an imminent image back-fill loading as well). Any interested parties could therefore populate the SAR table with structures some of which have higher in vitro potencies than the lead. While on first glance, the figures from the eventual paper (PMID 22390538) look better (K1 = 25.0 nM, NF54 = 28.0 nM) note that the units are different. It would have been useful to make the MMV390048 > example 15 > structure mapping in the paper explicit but at least there is a clear (and presumably correct) mapping via Google images. One piece of context here is that at least the main mechanism of action for this compound is probably via inhibition of a plasmodium protein kinase (although this is second hand because I have not accessed the paper yet). The second piece of context is that this lead was first made and tested in September 2010.

The first trick I tried was to cluster the PubChem CID "similar structure" neighborhood as shown below, but in 3D not 2D because this gives a larger cluster.

I have highlighted three CIDs in this cluster, the middle one being MMV390048. What was unexpected, for me at any rate, was that, in 3D conformer space, this is neatly sandwiched between two structures from the GSK 2010 antimalarial screening hits. The one that looks closest in conformation, CID 44525560, has a confirmatory XC50 for P. falciparum 3D7 of 190 nM This was deposited in Jan 2010 and the result must have been generated at least 6 months before that. Another search against SureChemOpen established that it is also exemplified as part of claim 19 in WO2011086531 and must have been picked up from there and deposited as Thomson Pharma SID 124771460 on 2011-08-15.

Now for a few conclusions, speculation and questions. Clearly the 390048 team did a great job to progress this lead derived from the kinase library. The speculation is that some of the same BioFocuscompounds, run in parallel by the 390048 and GSK teams, came up as hits on both sides. The big question is related to the current antimalarial leads being pursued by the Sydney team and their collaborators (including MMV). Might these also turn out to have kinase inhibition as a primary mechanism of action ?

Update, December, we have a paper out on the results.

The collaboration between the MRC and AstraZeneca to give UK academic researchers access to development compounds has been acknowledged as a precedent for the compilation of the analogous NCATS 58-set from multiple companies. A snapshot of the MRC list of 22 is shown below.

There is plenty of background information about both sets and repurposing exercises in general available on the web. In addition, the associated issue of public code names with very-difficult-to-dig-out (VDTODO) or completely blinded structures has also generated additional blog posts (e.g. at CollabChem and Chembl-og)

I have performed the same exercise for these as for the NCATS 56 small molecules. This is a) map the code names to a structure, b) assign a PubChem CID and c) search SureChemOpen for matches to early patent filings. The summary list is pasted below, a more extended table has been deposited at Figshare and a set of links for the 12 CIDs is available as a public MyNCBI collection ( http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1HKfLlxQ0OICuWKEPOFU48tky/)

AZD code	MoA	NCATS		CID	SureChemOpen patent match
AZD0530	SRC Tyrosine Kinase Inhibitor	Yes	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10302451	10302451	484 document hits
AZD1236	Matrix Metallopeptidase (MMP) 9\|12 Inhibitor	Yes
AZD1656	Glucokinase Activator	Yes	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=16039797	16039797	https://open.surechem.com/en/document/WO-2007007041-A1/
AZD2624	Neurokinin Receptor NK3 Antagonist		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23649160	23649160	https://open.surechem.com/en/document/WO-2007069977-A1/
AZD3355	GABABR1 Receptor Agonist		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9833984	9833984	https://open.surechem.com/en/document/WO-2001042252-A1/
AZD4017	11-beta Hydroxysteroid Dehydrogenase Type1Inhibitor
ZD4054	Endothelin A Receptor Antagonist		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9910224	9910224	https://open.surechem.com/en/document/WO-1996040681-A1/
AZD5904	Myeloperoxidase inhibitor		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10264211	10264211	https://open.surechem.com/en/document/WO-2009025618-A1/
AZD7325	GABAA Ion Channel Stimulator	Yes	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=23581869	23581869	https://open.surechem.com/en/document/US-20070142382-A1/
AZD8529	Metabotropic Glutamate Receptor 2 Positive Allosteric Modulator
AZD1080	GSK3b Inhibitor
AZD1386	TRPV1 Ion channel Inhibitor
AZD1704	Cannabinoid CB1 receptor Agonist
AZD4619	PPARA Agonist		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10217984	10217984	https://open.surechem.com/en/document/WO-2003051826-A1/
AZD4769	EGFR Tyrosine Kinase Inhibitor
AZD6088	Muscarinic M1 Receptor Agonist		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=25195463	25195463	https://open.surechem.com/en/document/WO-2009034380-A1/
AZD6605	Matrix Metallopeptidase (MMP) 13 Inhibitor
AZD6703	MAPK14 (p38) tyrosine kinase inhibitor		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=11373432	11373432	https://open.surechem.com/en/document/WO-2005042502-A1/
AZD7268	δ-Opioid receptor agonist	Yes	http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=24772484	24772484	https://open.surechem.com/en/document/WO-2008048171-A1/
AZD7687	DGAT Inhibitor
AZD8055	mTOR Serine/Threonine Kinase (mTORC1/2) Inhibitor		http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=25262965	25262965	https://open.surechem.com/en/document/WO-2009153597-A2/

I managed to dig out CIDs mapped to 12 of the 21 codes, but there are 5 compounds-in-common between the MRC and NCATS sets. Note also we have a patent-mapping full-house for the 12. These posts are about picking out the quirky details so lets see what we have...

AZD6703 will be a bit of a system test because the publication is recent and neither yet MeSH processed nor picked up by ChEMBL. It was also a dozy in the Goldilocks school of abstract drafting for lead compounds, as you can see below, where we see no less than the IUPAC, code name, target and indication all in the title (if the abstract had had some inhibition data and included the term "arthritis" this would have been an almost perfect pay-wall bypass!)

IUPACs in titles (where there are no direct MeSH > PubChem links) can be easily processed by chemicalize.org (the result is in the picture insert). This gives an exact match to CID 11373432 and a patent whack back to WO-2005042502-A1. However, what is odd is the opening out the MMDB > CID links for the PDB structures gives the set of four below but does not include 11373432.

What I think has happened is the not uncommon story of the ligand going into a crystal structure (i.e. dropped into the tube) not being exactly the same as the structure the software has pulled back out of the electron density data (see below)

In this case it was CID 56962314 on the left "out" vs presumably CID 11373432 on the right "in". Note that Thomson Pharma and SureChem independantly corroborate the rendering of the latter wheras the former is an MMDB orphan. I'll try to get back to reporting what MeSH eventually does with the linking.

I'll finish off with the strange and VDTODO case of disclosing a code name > struc for AZD5904 as an HPLC internal standard. The only hit you can find is in Google Scholar (below) not in PubMed via the abstract (PMID 22592983)

The text is slightly cryptic in that the candidate IUPAC (3-[(2R)-tetrahydrofuran-2-ylmethyl]-2-thioxo-1,2,3,7-tetrahydro-6H-purin-6-one) converted by chemicalize.org, is not explicitly juxtaposed to AZD5904. However, it is the only one in the paper and it whacks WO-2009025618-A1 "MIPO inhibitors for the treatment of huntington's disease and multiple system atrophy" which makes it a good bet.

Any way you look at, it the progress of PubChem over the last 8 years has been impressive and many biological chemistry domains have been transformed as a consequence. Much credit is thus due to the NCBI team (some of whom it has been my pleasure to have a beer with). However, such eulogizing as I may occasionally indulge in on this blog needs to remain short and serve as an introductory transition to more technical matters.

For the record, the SID count today was already north of the 100 million landmark announced yesterday, at 100171777 (you will see a different figure if you click through, as the submission processing continues to notch up, including a few "on hold"). Like most truisms, its not so tremendously insightful to state that the utility, value and impact of PubChem depends on it's sources. But it makes the point that submitters therefore share some of the credit. What I have chosen to do here is dig in a bit behind some of the big submitters in that 100 mill. A logical first cut is the ~ 50 sources above 10K (see below).

This looks long-tailed but is clearly numerically dominated buy the big sources. We can thus cut the top-ten to see more detail (below)

The SID counts (vertical axis) were extracted directly from source status(I initially got zilch back from the AKos source CID selection via the front page drop-down list but editing out the "&amp" characters from the query did the trick). There is an unlimited range of slicing and dicing we can do here but I can just go through a few interesting questions using some of the pre-cooked filters I already have in MyNCBI . Note that I have knocked out these stats out in fairly short order, so if you spot any errors please let me know.

Do any have an SID:CID ratio significantly greater than ~ 1:1 ? This is an arcane question but it does effect the interpretation of any further analysis done at the CID level. The answer is; only SCRIPDB at 6:4. This may be related to patent document mapping.

What do these add up to in terms of total coverage ? The SID total was 78.8 million (i.e. close to 80%) . The CID total was more surprising, for me at any rate, in being 32.7 million (i.e. 92% of PubChem by compound). Just for the record the 10-way intersect was only 18,117.

What is the relative contribution of unique structures from these sources ? There are aspects to this question that cannot be detailed here but some discussion is included in publications (e.g. PMID 20298516 and PMID 22024215). Practically, it can be addressed via the intersect of 1[DepositorCount] (= 17.2 million), with the CID count from each source. I have plotted these as absolute numbers below but it can also be useful to transform these to the % of that source.

This graph gave me a few surprises but we need to mull over some caveats. First is the important technicality that the query result implies that the structure supplied by the depositor has, according to the PubChem chemistry rules, emerged from the pipeline as a de-novo (i.e. novel) CID. However, it may not be unique in the cannonical sense, if for example, a (new) mixture or unresolved stereo centers for which resolved versions already exist as "same-connectivity" CIDs (or vice versa).

So what were the uniqueness surprises ? Firstly the high ZINC and AKos vendor numbers. One explanation is that there may still be some virtual records in the former (see Shrinking-pubchem-yep-it-wuz-us) If AKos really have been cranking out 3 million novel synthesized structures, good on 'em, but a representational heterogeneity check would be needed to corroborate this . The second surprise was that Discovery Gate showed unique content at all because I had already established some of these to be 1-on-1 depositions of only those from the 22 million at-source structures that were in PubChem already (i.e. ~ 50%). While the usual representational differences may be the explanation it was also a surprise to find out that submissions ceased in May 2011 (SID > rank by date). While this is obviously related to the New DG roll-out it is unclear (because I am not one) if and when subscribers will get a new set of live links to the ACD and SCD collection, or if these will be re-mapped to the 11 million legacy SIDs. The third surprise was I did not expect ABI Chem to show the opposite trend to the other large vendors by having virtually no unique content.

What about mixtures ? These are also possible to query check. First up, PubChem contains just over a million CIDs with a covalent unit count of 2 or more (this may seem low but it has a big proportional impact in drug-like space but that's another story....). What is more significant is that 55% of all mixtures are submitter-unique (any ideas why ?). The plot below shows (on the vertical axis) the % of mixtures in the unique CIDs for each sources (because the count is covalent units identical multimers are also included but are not that common).

Note the absolute numbers behind these percentages are small and so do not explain big uniqueness differences, but they raise questions. The largest by far in absolute (unique) mixtures is from ChemSpider so it would interesting to know how they got there (and the1.7 million unique singletons as well, for that matter). Just for the record ZINC-unique had only three mixtures.

So why such big uniqueness differences ? The short answer is that it is the cause is not obvious in most cases but we can gain some insights using the PubChem toolbox. An important aspect to consider is the difference between active vs adventitious circularity. While there are complications (that may be worthy of another post sometime) any CID with more than one source has adventitious circularity in the sense of pointing to multiple, but nominally independent, submitters (e.g. aspirin has over 1000 including mixtures). However, the circularity between Discovery Gate and Thomson Pharma (2.8 million CIDs in-common) is not accidental, in the sense that, up until the end of 2011, they both deposited the same Derwent WPI patent-extracted structures. Thus, unique CID counts for any sources with active circularity are confounded. In the case of NextBio, the absence of any unique content (beyond ~1% CID "noise") suggests the origin of this particular circularity was a "piggy backing" process, whereby structures were simply extracted from PubChem and re-deposited as new SIDs. This inference is supported by the deposition or modification dates being clustered only in large batches between 2008/9. So is ABI Chem also "piggy backing" ? While submissions are also confined to historical chunks (this time between Feb and April of 2011) the assumption could be a bulk stock match to the early 2011 PubChem content. However, real existence as stock cannot be verified (unlike the other vendors in the top ten) because, for the number of entries I tried, neither their own code names, nor the IUPACs produce matches on the website. This would seem a bit of a commercial oversight because not having unique content increases the likelihood that other vendors have live stock links for the same CIDs.

Can we make any representational quality assessments ? Another caveat with unique content metrics is the difficulty of getting data that can resolve between the extremes of exclusive and valuable compound content on the one hand, vs. quirky representation on the other (e.g. dodgy structures that slip through processing rules to spawn de novo CIDs) and realistically, most sources have a bit of both. There are a couple of toolbox approaches I have played around with. The first is a combined query of incomplete chirality and E/Z resolution (including those crossed-bonds) that affects 10.3 million CIDs. The results applied to these sources defied straightforward interpretation so I'll park this (but I can share the query). The second approach is to open up a nominally unique CID list via "same connectivity". The problem here is it only does the fist 10,000 so the sampling is non-random (e.g. just ranked by CID). Nonetheless, there is no harm in trying so from 1700099 Discovery Gate unique CIDs the first 10K spawned 28149. Thus those 10K are non cannocially unique but, on average have 2.8 "same connectivity" CIDs. The set of seven below shows the problem.

The only Discovery Gate unique entry is CID 45273694 but is just one of seven stereo versions in this case. My supposition is that we are seeing quasi-uniqueness from "imperfect piggy backing". This is supported by looking at the same-connectivity intersect with 868 Thomson Pharma CIDs (that Discovery Gate does not have SIDs in) where we can see "imperfect pairs" with different representations for the same same structure. Thus, as predicted, it seems Discovery Gate may have no canonically de novo content, unlike Thomson Pharma.

What about lead-like content ? One way to address this is obviously via the PubChem constitutive Rule-of-5 filter (you can mouse over this in PubChem to see the details) but I have added a 300-800 Mw cut on top of this. I decided to display this for the source totals (shown below) rather than unique-only.

While there are no big surprises here (note I have included as "All" the figure for all 35 million CIDs as 59%.) some trends would seem to be interpretable. The middle set of NextBio, Discovery Gate and ChemSpider are each so large they are tending towards the 60% average for all 35 million. Towards the top end, we might expect vendors to selectively bias their properties in a lead-like direction and ZINC tops this trend at 76%. While the fact that the two lowest figures ~40% happen to be patent sources may not be coincidental they are quite different in their own right. For SCRIPDB the automated extraction of reagents, intermediates and non-medicinal chemistry patents could be reducing the average lead-likeness. Neither of these reasons should apply to Thomson Pharma but the presence of chemical journal extractions may be having the same effect.

What about bioactivity ? This is also easy to count because it is presented as one of the default filters you see in the lower right section of the query result panel. These have been converted to % figures for the sources (see below), also with the inclusion of "All" that currently records 803241 actives of which 436,235 have at least one ChEMBL link to literature-extracted data. There is a sprinkling of chemical property assays, probably a small proportion but there is no global filter to select these out. (yes, I could have sliced out activity for the uniques-only but this post is getting long....)

My guess is none of these sources have strategically selected for actives (althought if I were a vendor I certainly would) but the literature and patent example extraction could explain why Thomson Pharma scores high. The intriguing paradox is that lead-likeness and activity capture seem not to be correlated (e.g. Thomson Pharma is lead-low activity-high whereas ZINC shows the opposite trend).

So what about the value of these big sources ? This is a moot point because the Achilles heel of PubChem is that they are just too nice (there are nice anyway but this is encouraged by mandate). What this means is they typically will not say "we appreciate your offer to submit x-million pointers back to your website but we judge their link value in the aggregate context for the majority of users to be low - so no thanks". This challenge for setting the value/quality entry bar is, of course, crucial for all mulitiple-feed dbs, commercial or public, bioinfirmatics or cheminformatics-centric (not to mention the diplomatic dimension). Notwithstanding, as an experienced, independent user (and big fan), I am at liberty to "tell-it-how-the-data-says-it-is" that I hope adds value to this blog. In this context, the data indicate that Discovery Gate, NextBio and ABIChem present 29.4 million dead-links (i.e. ~ 30% of all SIDs) without any de novo content. These not only clutter the place up but also add confounding circularity to query results. Ipso facto , users would benefit if these sources choose to deprecate. Due to the niceness of PubChem this could certainly be reversed when value had been suitably re-vamped (e.g. a fresh set of active Discovery Gate subscriber links, live stock connectivity for ABIChem, or NextBio restriction to links they actually have data for).

Update, 29 Sept: A new BioStar posting now informs that the genome and proteome are available directly from the GigaScience website that also includes BLAST and download options. Just for the record, we agree (or probably both used GENESCAN) 100% on the BACE-like homologoue they have as OYG_10007802. Kudos to BGI for surfacing the data like this but it does now bring the portals we have to trawl for new metazoan genomes up to five.

************************************************
Given my interest in protein phylogeny in general and the evolution of BACE in particular I was initially pleased to see that the title of a new Nature paper (The oyster genome reveals stress adaptation and complexity of shell formation) implied not only that the Crassostrea gigas genome had arrived (and was an OA article) but also it would be the first complete mollusc. The paper includes links to masses of supplementary data but my efforts at finding a public genome assembly I could search my favorite proteins against proved fruitless, certainly since the only links in the paper pointed to SRA data files straight off the Ilumina. I posted this problem over to BioStar where I received some useful comments (these folk received 8 votes and over 180 views because it became quite chatty) but these were mainly tips for processing the entire dataset myself that I had neither the time nor the inclination to contemplate.

Rather than stuff the paper with supplementary data it would have been better to have finished the job, at least to the point of where a reasonable assembly could have been public and searchable. Quite why the Nature editors and referees did not make this a condition of acceptance is unclear. This is part of the problem, for the metazoans at least, of which genome goes where, in what final state and when, between the JGI, NCBI, UCSC and Ensembl. We shall see which pipeline outputs the data from this exemplary mollusc ends up in.

The story does turn out to have a happy ending, at least as far as what I wanted to accomplish, because I was able to find Crassostrea as a taxonomic select in WGS for a TBLASTN. It then only took a few minutes to search BACE1, whack the contig AFTI01022267 at e-23, paste out the FASTA record and run GENSCAN. This got the 520aa ORF right off the bat (see below).

This turns out to have ~ 50 % identity to mammalian BACE1 as we can see from the BLASTP alignment.

This is a useful result as this constitutes the only full-lenght BACE from this major phylum (although I have found some partials from clams and mussels). There was no EST or TSA data for ORF corroboration in this case but the BLAST output (above) is not overly gapped and the InterProScan (below) looks cogent at both ends for the signal, C-terminal TM and the other UrBACE signatures.

The devil is, as they say, in the details. I enjoy ferreting out interesting cases, so here is the story of what could be termed "noise in the system" for the compound alluded to in "Asymmetric synthesis of a potent, aminopiperidine-fused imidazopyridine dipeptidyl peptidase IV inhibitor" (PMID 20128619). For reasons that I hope may become clear in print before not too long, but have already mentioned in a presentation, DPPIV inhibitors are useful examples for searches related to the extraction of chemical structures from text, A simple PubMed search duly retrieves the abstract above. What I initially perceived as good news was that the folk from MeSH had extracted an IUPAC from the paper (below).

The bad news is this annotation somehow got on to the wrong bus across to PubChem because it produced a ghost entry assigned to the MeSH tree, not the leaf of the hierarchy, as indicated in the SID record below.

Unfazed by this, I assumed that, with the aid of the trusted IUPAC annotation from MeSH, I would be able to map this to a CID in short order. I need not detail what circles this sent me round in, but I can illustrate the problems below. My first step was to run the IUPAC through chemcalize.org but, because of the wedges I ran OPSIN as well (see both below).

These not only gave different stereo results (chemicalize top and OPSIN lower left) but both SMILES produced a "flat" query structure in the PubChem search box (lower right). These turned out to be orphans (i.e. no matches at 95% in PubChem or SureChemOpen). This was unexpected for this implied key compound from Merck. Cutting the story short, I clicked through from PubMed to the journal abstract with a structure image (below) and noticed this was different by one ring nitrogen from the IUPAC conversions above.

Because the rendering was large and clear I decided to try out the ChemSpider upload just for a change from OSRA (but its the same code behind it ?) and the conversion worked fine (below).

Being just one click away, I also launched the ChemSpider search, rather than pasting the SMILES over to PubChem as usual (below).

Bingo, we get a full house of key links from CID 15953860. At this stage we thus have to question the relationship (or the reality even) of (R)-7-(2,4,5-trifluorophenyl)-5,6,7,8-tetrahydro-2,4b-diazafluoren-6-ylamine. I have not accessed the full paper so I cannot tell if the MeSH annotator chose a different compound from the lead or even made an error by dropping a nitrogen from the ring. This is certainly not a blaming exercise, but it is important to understand different types and sources of equivocality in the "system".

We can pick up different types of equivocality associated with CID 15953860 as shown below, because the same-connectivity SIDs have 8 permutations between three stereo forms, a "flat" and two salts.

The stereo alternatives are also reflected in the patent matching results in SureChemOpen searching with the canonical SMILES (below)

Notwithstanding the finer points of wedges these results seem to tie up. In addition we have some ChEMBL and PDB links to follow through. Tying in with the first patent publication in 2007 the Merk team had published structures in 2009 pointing to two PDB entries (PMID 19539471) of which "compound 34 " = CID 15953860 (= compound 1 in the 2010 publication PMID 20128619) and the second "compound 25 = CID 11710963. We can get the corresponding patent whacks that look like the same family - but - spot the quirk .... (below)

Just in case you missed it the implication is two patent families with the same title. Rather than wade through INPADOC to check this out its quicker when I had PubChem open anyway to corroborate the SID dates chosen by Thomson Pharma. Sure enough these were a year apart (i.e. the curators decided the families were different).

So where does that leave us ? Most of the big circles were eventually squared but we are still left with quite a collection of quirks from this example. They include:

1) PMID 20128619 had a dropped MeSH IUPAC > PubChem connection and includes a possible MeSH error
3) chemicalize.org, OPSIN, PubChem and SureChem search IOs were close enough to pick out isomers but they don't "round trip" exactly via SMILES
4) The "same" Merk lead parent structure was instantiated in 8 different SIDs
5) Found distinct patent families with the same document title
6) A dropped CiteExlpore > ChEMBL link for PMID:19539471
7) For CID 11710963 the link to PMID: 19539471 is only indirect via MMDB. It would have been better if the MeSH annotators had picked up the obviously central linkage to compounds 34 and 25.
8) PMID 20128619 is secondary literature describing the synthesis rather than a primary SAR report so ChEMBL won't typically pick these up. This meant there was no database linkage between the two PMIDs with the same compound as the main theme.
10) BindingDB mirrors target-mapped ChEMBL records but still includes the deprecated ChEBI IDs as the main synonym
11) The salt-to-activity mapping rules are particularly unclear in this case. In ChEMBL the HCL and TFA salts are mapped to the same assay results but the parent picks up additional ones from the same paper. In PubChemBioassay, derived from the ChEMBL record, they are all mapped to the di-TFA salt. In BindingDB the activities seem to be mapped to the HCl salt but the tris ion specified in the IUPAC is not in the structure record.
12) The SureChem extractions detected different stereo forms from different documents in the same family.

Update SureChem slides from ICIC 2012 now posted.

This current analysis is a prelude to a presentation that will be given by SureChem at the ICIC meeting in a couple of days time (Wed 17th am). Ishall pick up some themes from this retrospectively, including a close look at the SureChem deposited PubChem content, in due course. For now, I can show some approaches I have applied to pre-existing content. They are of general interest I hope andmay add context to the slides that are going to be presented.

Of the four Lipinski properties I find Mw the most intuitively useful for comparing bioactive chemistry sources, particularly because of my biochemicalbackground. Its also easy to cut from PubChem for any result set (but note you can slice by any property parameters). Below is an example using the advanced query dashboard for the whole of PubChem and the chart generated from those results.

With these presets you can move on to other sources, do the intersects and then normalise everything to percentages (this is the vertical axis units on all the charts below). The current major patent sources are shown, first in 3D to give a convenient overview.

From the bottom these are; SCRIPDB (3990915), IBM (2362985) and ChEMBL (760889 in PubChem). For Thomson Pharma (TP) we can make an aproximate split of the literature from the patent component via the DiscoveryGate intersect. This should be predominantly Derwent World Patent Index (DWPI) pharmaceutical B extractions (2796464, TP AND DG) but probably includes some literature DG have extracted as well. The non intersecting component isthe non-patent section of TP at 1003736 (TPnon-P). Interestingly, the intersects of these splits with ChEMBL are the opposite to what we might expect (293805 and 70545 respectively) implying 10% of DWPI structures are appearing in med. chem. papers. A comparable analysis using the GVKBIO structures extracted from papers and patents recorded a 9% overlap (PMID: 17897036)

Because furthercomparing of patent sources inside PubChem makes more sense after the forthcoming SureChem load Ishall just take the opportunity to outline some slicing methods. However, some trends can be perceived, for example, by comparing the two extremes as IBM (automated Chemical Named Entity Recognition, CNER from patents) vs ChEMBL (expert journal extraction), as shown below.

There is a skew of IBM towards lower Mw and a high-end increasein ChEMBL. We can attribute theIBM difference to two possible factors. The first is that CNER fails proportionaly more often on longer IUPAC strings or chops them artefactually intosmaller bits that pass the filters (PMID:22148717). The ChEMBL curators do not have this problem, so, while they may take a little longer to sketch the big stuff, predominantly peptides and antibiotics, these get captured. The second factor is thatbecause ChEMBL focuses on SAR it does not typically capture the reagents and intermediates that CNER will automatically pull out along with the examples. Note also that CNER pipelines such as IBM do not discriminate between pharma and non-pharma patents but any affects of this on Mw distribution per seis unclear. The chart below also includes ChEMBL but this time it is compared with TPnon-P, the presumed literature-only non-patent extractions by Thomson.

The overall similarity supports the idea these are both manual literature extractions but TP non-P has more big stuff. My speculation here is that this is due tonatural product papersbeing part of the TP feed. It could thus be these more complex structures that are pushing up the proportional distribution above 600 (if anyone has internal instances of these two databases anda download of the Dictionary of Natural Products(DNP) structures they could perhaps check the appropriate intersects and post a comment).

So how do drugs lookby the same cuts? For this I took the nearest approximation to a clean drug set in PubChem, AID 1195
(the FDA Maximum (Recommended) Daily Dose Database). This stands at 1216 but for various reasons I filtered back to 961 by MeSH pharmacology and 957 by parent CID. The comparative plot with ChEMBL and PubChem (all) is below.

There is a lot of literature on drug-likeness parameters andits no big surprise that approved drugs are smaller than the more lead-like research-phase actives in ChEMBL. Because PubChem records both pharmacological testing via MeSH and the tested/activity results for each CIDfor the last plot I have combined Mw with bioactivity cutting (below).

As we might expect the three parameters do not track in parallel but are in the Mw order; drugs < pharmacology in vivo < activity in vitro. However, as I plotted them I spotted a caveat for the latter, namely assay testing is not uniform across the Mw range. These should have thus been normalised to the actives:tested ratio (but some other time maybe). Further interpretation of these plots encompasses some big themes but for now it is sufficient to indicatehow straightforwardly these analyses can be done inPubChem.

This post was inspired by the following conjunctions:

1) The BBC Radio 4 Today slot about the GSK transparency on clinical trials.
2) A newpublication onOpen data for drug discovery: learning from the biological community has appeared from ChEMBL and GSK authors.
3) My co-authors and I had the good news about the acceptance of ”Challenges and Recommendations for Obtaining Chemical Structures of Industry-Provided Repurposing Candidates” (there is some background in this blog post chainand we willinform when the paper goes online).
4) There is auseful commentary on the initial GSK press release.
5) A recent conference presentation from Ben Goldacre outlines his plans to mine clinicaltrials.gov (the tail end of his talk from 18:20 onwards) and, amongst other provocativeideas, one is to encourage independent interested parties (e.g. patient groups) to actively ”prod” sponsors for overdue data.
6) I moved over to figshare a magazine articleco-authored at the beginning of this year, entitledConnecting Up: assessing the name space and molecular mappings of the drug interventions in ClinicalTrials.gov

The essence of the GSK press announcement is thatdetails from clinical trials will become available so that others (implicitly on the basis of approved but non-GSK affiliated data mining) can draw independentconclusions about safety and efficacy of their new therapeutic agents. In addition, the top-200 TB screening hits are due to be published (but no details yet on database depositions). As interestingas these declarations of intent are, the real utilty for "us" will be dependent (as ever) on the technical details where the rubber meets the road (to use a US cliché). Notwithstanding, I'd take bets that other pharma companies will follow suit. The interesting "Open data" article is somewhat orthogonal to the code number/trial data issues per sebut, significantly, it announces that GSK are pioneering a short-cut for direct deposition of supplementary SAR data into ChEMBL.

We can do here is checkhow well, obviously retrospectively, GSK havepublicly declared (or made plausibly findableat least) structures associated with the code names of their recent clinical trial candidates (i.e. the basics of transparency). Notwithstanding the caveats described by us and others (including Ben Goldacre) it is possible to perform queries at clinicaltrials.govwith useful specificity. For example,I was intrigued to find that nested wild cards could pull out current and pre-merger code name stems, via the simple query "Drug | Interventional Studies | GSK* OR GW* OR SB* | Glaxo SmithKline [Lead] " (you can even find some old SK* hits) The returns are shown below, along with a standard sortable Exceldownload.

I can drop the sheet out on figshare if anyone expresses an interest but it is easy to generate (note also this could be done for any consistently prefixed set of company codes but ideally triples).So where do we go from here ? I am sure many of you can think of interesting options for the list but for this post I shall just pop code names of the most recent trial declarations to see what I can find. I had intended to do the first 10 but, as is often the way, just the first oneproduced an extended story that became more than enough for one blog post.

While the code name resolution triage is already described for the NCATS code names, I tried what I hoped would be a shortcut via the NCBI all databases Entrez interface because you could, in theory, match to PubMed (PM) and PubChem (PC) at one pop, but, this turns out to be hyphen sensitive (and so is PM), so I had to revert back to the two-stop individual sources. First on the list, GSK1605786 is PC-negative but at least we get two PMs (below).

These were not GSK publications and I don't have access.However, what unexpectedly came up trumps was a Google Images search (below)

The match is to a ChEMBL blogpost on the April 2012. USANs. This gets pretty close to the horse's mouth because it includes a link to the PDF of the GSK approval for Vercirnon. I was just about to convert the IUPAC in the PDF via chemicalize.org but, having fortoutisly resolved the code name to aUSAN I tried an open Google search (below).

So now we have extended the mapping chain to: vercirnon, vercirnon sodium (the usual USAN parent-salt doublets) , Traficet-EN CCX282-B (both ChemoCentryx legacy names as licencee) and GSK1605786 or GSK-1605786. This establishes that both the USAN (and the INN) PDF contents are Google-scraped. Last but not least we also have a ChemSpider (CS) database hit (below).

There are someinteresting aspects to CS 8518913 but it needed some Tweeting and a comment feedback to suss out the details (and thanks for the responses). First up is that CS are taking a feed of new USANs via ChEMBL cloud resources. The quirk is that the RN (CAS Registry Number) from the USAN gives a false-positive return from PubChem because the query runs as (698394[All Fields] AND 73[All Fields] AND 9[All Fields] ). We thus get back NSC698394 as CID 3107921 . Should you want to interface pop or script up RN queries it needs to be ("698394-73-9"[CompleteSynonym])To be fair, false positives like this case are both rare and obvious because they only happen a) where CS has an RN that PubChem does not and b) there happens to be a spurious 6-digit match in the CID fields. I am pleased to add that not only were my updates to the above entry added in short order, so you will now see the synonyms revised as below, but also the URLs associated with the RN flag were fixed.

OK - so who else has picked up this RN ? The Google results are below.

Surprisingly, along with the instantaneous capture of blogger posts, and the sources we already expected, this seems to be purchasable already (top link page below).

However, my Avast antivirus gave me a malicious URLwarning from the LookChem home page so I am disinclined to inspect the entry, but it looks like a search engine optimsed, dubious secondary brokerage operation (no stock, just calls for tenders). We could speculate their RN > structure link(note the patent reference) is derived from SciFinder, maybe after picking up the USANs. We should thus move swiftly to some more solid informatic gound via the PC link from the CS entry. The SIDs under CID 10343454are shown below.

Only two of these are primary sources (the rest are piggy-backing) and, as already established, neither included any of the synonyms above. First was Thomson Pharma, back in 2006, presumably from a maual extraction of an early ChemoCentryx patent. Second in, 6 years later, wasan automated patent extraction by SCRIPDB entering as the SID just a fewmonths ago. OK, so the next go round the sources can be via the InChIKeyJRWROCIMSDXGOZ-UHFFFAOYSA-N (below)

large;">The CS hit was no surprise but the chemicalise.org direct match was, because there was no PubChem entry for this source as would be expected (Imay follow up on this). Just for the record, the entry is shown below.

So, after these technical digressions where does this leave us in the transparency stakes ? To be fair to GSK a late licensed compound is not a good example as the research data was not generated on their watch. We can look at their clinical trials search portal and it comes up on the Google hits (below).

But, while it is nice to get the summary reports they do not specify the structure. However, if we put the legacy code in as well, we now bring back 11 studies (below).

We can do the same thing in PubMed (but remember you need the hyphens here) which brings us up to four reports.

OK so lets tot up this weeks quirk list

1) Beyond the INN and USAN applications GSKhave done nothing to "transparently" declare a name-to-struc. For INNS and USANs its only Google indexing that makes them findable. Why these two crucial operations have never seen fit to put up a proper public database is a mystery. Note also, unless you use domain selection in advanced Google search for GSK-1605786 you would be hard put to pick up the USAN PDF ranked at ~100because the top matches are swamped by clinical trials mirroring and replication sites.

2) Note in this case the intersect between the USAN picked up by ChEMBL and the CS entry was fortuitous, not systematic capture. CS only had structure to link the new USAN information to because this had been pulled across from PC pre-2007. There is no direct Thomson Pharma feed to CS so this old one was picked by chance because ChEMBLdb (that does have a CS feed) does not have the structure.

3) The capture of USANs by ChEMBL with a CS feed is welcome but it does seem paradoxical that the structure mappings surface in London but not Bethesda. However, the structure itself is not yet in ChEMBLdb because neither ChemoCentryxnor GSK saw fit to publish a primary medicinal chemistry SAR paper that would have been captured by ChEMBL.

4) I can only accessone of the four PubMeds but guess at the reasonsfor themapping failures on the Bethesda side is that none of the papers explicitly specified a structure that the MeSH annotators could have picked up and eventually get linked to a PubChem entry.

5) Legacy code number changes due to licensing are become more common and hence more problematic because of the "lost forward-mapping". In this case, the recent publications have included synonyms but we still have two PubMeds and three clinicaltrials.gov entries that only retrieve via CCX282-B (i.e. they are GSK-1605786 -ve and thus do not link forwards). Note here we havesuffix ambiguity for usage of CCX282 +/- B even by ChemoCentryx themselves. The status of the synonymTraficet-EN is also unclear because it is quoted as a Trade Mark but is not an approved brand name for the INN (anyone know what -EN stands for ?)

The Korean medicinal chemists that feature in the In The Pipeline “Oops we didn’t mean to publish that” posting could be forgiven for feeling a little hapless in the face of this exposure of the interesting results published in their paper and patents. I am not inclined to add to their discomfiture (any blame should attributed to their managers anyway) but the story does have some interesting informatics threads and is another code name case. Their paper, retracted ostensibly but illogically for IP reasons, was about the mode of action of a development compound (SKL-NP) on channel currents (below).

The first question is what is SKL-NP ? It is PubChem –ve but before we follow up in the text it is clear that is not the structure sketch from figure 1. This was a prior-art scaffold that the authors included in their own patent, cited in their paper as US20110195963 (below)

Nominally the IUPAC is “R-Carbamic acid 3-[4-(3,4-dimethoxy-phenyl)-piperazin-1-yl]-3-oxo-1-phenyl-propyl ester” but they assign two Mw’s to this, 388.27 under figure 1 but 509.5 in the text. First thing was to Google the IUPAC with the quotation marks (below)

This ties the patents and paper together nicely but it turns out they dropped a bracket and the number for what should have been (1R). Next up was to usechemicalize.org to convert the IUPAC, then a PubChem SMILES pop, followed by "same connectivity" (below).

OK so we have the two stero and the flat as CIDs all extracted from the Korean team patents by Thomson Pharma and SCRIPDB. For the record (and Google indexing) the SMILES are COC1=C(OC)C=C(C=C1)N1CCN(CC1)C(=O)CC(OC(N)=O)C1=CC=CC=C1 and the InChIKey is HMSSTVIZRKGLHN-UHFFFAOYSA-N. Next up is the SureChemOpen result for the canonical SMILES search (below).

So far so good. While the Mw of 413 does not match the choices in the paper, we have wacked the patents that probably specify SKL-NP regardless of whether we have the right example structure. Given their explicit mention of IP issues as the reason for retraction it seems odd theyare only ones to have exemplified the structure anyway, so I took a cursory look at the three patents. This is where the plot thickens as we can see in the titles and abstracts (below).

As befits such IP situations I shall pass no opinions beyond what is explicitly written in the documents (the priority dates and patent family connections are there for inspection) but the two applications, with one inventor different,specify the same structure (shown above) not only as an 5-LO inhibitor (example 1 in 5536) but also to haveanalgesic effects in animal models (example 5 in 5963) as well as being an HCN channel blocker as described in the paper.

So whats going on ? I dunno - but these two applications and the paper seem have a few unsquared circles between the results, claims andstructures. Maybe this represents a particularly nifty indication switch (early bird repurposing even?) but it's quite a jump from lipase inhibitors for diabetes to channel blockers for pain. A more parsimonious explanation is that things may have got a bit mixed up somewhere. If they had not got their name in lights because of the retraction (which changes nothing beyond the enduring embarrassment of the banner)nobody would have noticed any of this - except perhaps the patent examiners.....

I’m more interested in visitor milestones (and thanks to everyone for pushing these towards 15K and ~1.7K pm) rather thanposting metrics, but I thought the 50th post warranted a little extra. As if by magic, something popped up in my Twitter feed a few days ago which spawned a set of connectivities that intersected with a number of themes I have been engaging with, both here and in recent manuscripts. Even for what you might be used to already, this story takes some grappling with. I’ve tried to logically unfold it as best I can but you might like to reach for the coffee, glass of wine, crack a tube, or whatever you enjoy as perspicacity booster.

The trigger posting (below) rang faint bells related to target enzymes but my curiosity was piqued to open the link by immediately guessing that AZ1 was not a real development compound but might nonetheless be connected with one.

It turns out I was right. While it would not be my choice to append a flippant tag line to a commentary on a serious disease a pair of F1000 Respiratory Disorders experts had singled out "Late intervention with a myeloperoxidase inhibitor stops progression of experimental chronic obstructive pulmonary disease" (PMID: 21997333 ) for comment because the compound both prevents and slows the progression of lung pathologies in guinea pigs and, according the F1000 experts, provided a rationale for future studies in humans with COPD.

The abstract did not disclose the identify of AZ1 but I noticed MeSH had linked it to 3-((tetrahydrofuran-2-yl)methyl)-2-thioxo-7H-purin-6-one, as a supplementary concept even though there was no PubChem integration. Popping the IUPAC into chemicalize.org and checking the internal "similar structures" search immediately surfaced theconnectivity (that I had no idea about beforehand) to one of my own blog posts (below)

This was nicely corroborated by the InChIKey inner skeleton search on Google (below)

The MeSH annotation starting point thus facilitated the connectivity of the nominal AZ structure inPMID: 21997333 to two major database entries and my own blog post that included a mapping to AZD5904. The caveat is that the databases contain enantiomers and a racemate for this structure. Note that I had edited the 2R InChIKey into the FigShare Excel sheetaccompanying the blog post, but the MeSH entry was the flat one. However, using the skeleton key search picked up both entries because chemicalize had automatically converted the IUPAC string from the blog. It could not pick up the InChIKey from the FigShare file (by definition, but there is another quirk here if anyone can spot it). These three structures are nicely rendered in the ChemSpider entries corresponding to the three PubChem CIDs, 2S (CID 10467378), flat (CID 10354842)and 2R CID 10264211, shown below as the internal results for the InChIKey skeleton search.

While I could not access PMID: 21997333 directly from the PubMed link I founda full text "backdoor" link via a Google search with AZ1 and COPD. This happened to hit the LinkedIN page of one of the authors who had not only listed the title in his publications but also provided an advanced publication link(this is actually also open on the publishers site). In the methods section we find "AZ1 (3-[[(2S)-tetrahydrofuran-2-yl]methyl]-2-thioxo-7H-purin-6-one)" (CID 10467378) whichraises interesting questions. The first of these is why the MeSH annotator chose to omit the stereo designation. The second is that, having made the connection I assumed this was a case of "reverse blinding" (i.e. including the structure but not the code name). However, it looks like the authors may have chosen to do these guinea pig studies on 2S (i.e. neither the mixture nor the 2R) Ipso factoAZ1is (probably) not AZD5904. Another question arising from this manuscript is that it finishes with the optimistic quote "AZ1 may be useful for thetreatment of COPD in humans". Regardless of which enantiomer AZ1 is,the paper could have at least mentioned that AZD5904 had dropped out of development(Google hit below).

While this was announced in 2005 it was still listed in the 2008 development portfolio as in phase I for MS(not COPD) but was eventually designated as dropped by 2010. Presumably, as a consequence, it now features in both the 2011 AZ/MRC collaboration and the NCATS list for 2012. Both these facts probably escaped the notice of the F1000 reviewers (this information was Tweeted back to F1000 but hasproduced no response yet). The AZ portfolio listings also include a newer MPO inhibitor, AZD3241indicated for parkinsons disease, but I drew a complete blank on structure mappings.

I should back-track at this point to explain how I made the code-name-to-struc connection in the first place. The only publication hit for AZD5904 in Google Scholar (but not in PubMed directly) is because the Googlebot scraped text that included the code number from behind the journal pay-wall for the 2012 paper (PMID 22592983). This was somewhat bizarre in that, as the only publication by AZ that declares the name-to-struc for the ex-clinical candidate, this describes not pharmacology but use of the compound as an LC/MS signal standard. The disclosure is also partially cryptic in that "3-[(2R)-tetrahydrofuran-2-ylmethyl]-2-thioxo-1,2,3,7-tetrahydro-6H-purin-6-one" is not juxtaposed to AZD5904 in the same sentence, but, given the extra information below, it would seem to have been correct.

We can thicken the plot at this point by bringing in a 3rd AZ affiliated publication, namely "2-thioxanthines are mechanism-based inactivators of myeloperoxidase that block oxidative stress during inflammation" (PMID 21880720 2011). The good news is, not only does it exemplify structures both as images and IUPACS together with IC50s, but it is also open access (see the tested compound list below).

The bad news (for anyone interested in 5904 but not finding my blog post) is that this is definitely "reverse blinded" because it includes the AZD5904 structure (CID 10264211) but with a surrogate code nameTX-4. There are additional informatics points inthis manuscript;

a) It cites the earliest patent WO2003089430 "Thioxanthine derivatives as myeloperoxdase inhibitors"
b) There are PDB structures deposited from this study but the (covalent) ligands are TX-5 (CID 9815972) and TX-2 (CID 10378726) not AZD5904
c) It does not cross-reference the other two AZ papers (i.e. PMIDs 22592983 and 21997333) but 21997333 does forward-cite it.
d) Because the MeSH annotators did not pick up any IUPACs,the NCBI system can't close the inner circle. Thus, we can follow PubChem > MMDB(PDB) > PubMed but there is no direct PubChem> PubMed link.
e) Some in vivo work is included but this is on inflammation in mice and only with the TX-3 racemate
f) While there are no error limits, note theIC50 for TX-4 of 200 nM (2R) is 1/2 that for the TX-3 racemate but no data was included for the 2S.
g) Having a PDB ligand and mamallian sequences allows some neat things. I'd be the first to admit that predicting the species cross-reactivity for AZD5904 on the basis of the interaction shell of TX-5 is pushing it, not to mention comparing a covalent vs a non-covalent inhibitor, but nonetheless there is basis for at least attempting this, even it would need a properdocking study and confirming experimentally. This is shown schematically below.

At the top is the ligplot display and below this are the positions and secondary structure elements in the vicinityof the two contact residues, R239 and E242, both obtained from the PDBsum entry for 3ZS1. Below this is the sequence alignment for human, guinea pig, rat, mouse and dog, in descending order. Theimplication is that the differences in close proximity to the contact residues could cause differences inthe cross-species potency, not just for AZD5904 but other compounds in the series.

We can now compare the two experimental publications to the information supplied via the MRC and almost-identical-but-not-quite from NCATS. information sheets. Firstly, the latter cross-references PMID 21880720 but the former does not (but this could be accounted for by the time difference). Secondly, they report a slightly lower IC50 of 140 nM. Thirdly, the NCATS document states "AZD5904 was in Phase I trials" but there are no entries in the AZ trial index or in clinicaltrials.gov (it turns out an additional five from the MRC 22 also have no match). Fourthly, the MRC sheet ends cryptically with "AZD6055 (sic) is poorly CNS penetrant". This code is Google-negative (except for a Brazilian licence plate) but is close to the seriesand does not look like a simple typo (could it be a back-up or even the 2S?). The NCATS sheet corrects this to 5904 andGoogle actually finds some data that might explain why the compound was not progressed for MS (below).

We can move on to some additional cross-checking. Having established that the MeSH entry for AZ1 was not in PubChem it was a surprise to record the following Entrez matches for AZD5904.

It turns out these these are false-positives originating from a single submission from the Therapeutic Target Database (TTD) as SID 134339075. The corresponding CID 177992 is L-694,458 an elastase inhibitor from Merck. While I have come across a few of these before and have e-mailed accordingly I decided to do some digging. Usefully, you can get acomplete ID cross-matching download from TTD with 18,981 rows that include 23 AZD entries, but oddly only 7 GSK numbers. It turns out that the AZ codes were extracted from Pipeline-Summary-January-29-2009. This might account for these outnumbering the GSKs because in PubChem it is the other way round (GSK* [CompleteSynonym] = 107, AZD* [CompleteSynonym] = 52). By doing some intersects I can establish that TTD havecontributed 20 AZD synonyms to the 52 (but there are ~5 Azdd azido derivatives in the wild-card result) butif anyone can think of an interface query for "which CID has synonym from which source please ping me. If I get to the bottom of how these false-positive structures got mapped in to PubChem I will post it here.

PubChem does have a heuristic that can "push out" orphaned synonyms but in this case even "AZD5904" [CompleteSynonym] will still hook it out. Unfortunately, the CID synonyms have beenindexed in BioAssay. However, while someone could add the correct synonyminto anotherSID (i.e. add a true positive) the false positive can only be removed by the submitter. In the meantime, as we know, errors propagateglobally and from the Google hits we can find at least two of these (below).

The first of these is a Wiki entry (from the Gene Wiki stable) and the second is a drug repositioning databasepresumably both seeded by bots (a.k.a brainless parsing).

We all live with the sublime and ridiculous contrasts of Google searching but I continue to be surprised at the depth of reach for bio and chemical terminology (not just InChIKey seaching that I hope to have something in print about soon). Something else we have to live with are the different flavours of IUPAC. However, one tip worth knowing is that if a company publishes an IUPAC in a paper the chances are they will have used the same (software to generate) that character string in their patent. The bad news is that MeSH may re-format the IUPACs from the papers and the database entries specify yet other variants. The advantages of IUPACs, if you happen to be reading a paper or a patent, is you need no transformation stepsbut can just fire them off to Google from the browser. The example below is from inside the text of PMID 21880720.

I have pasted in below a resultselection ofsearches from the IUPACS from the publications and MeSH indexing above.

I won't detail these but note a) the MeSH term hooks back to PubMed and alsofinds its way into the Comparative Toxicology Database who take a direct MeSH feed. Unfortunately this becomes a "linkless orphan" because there was no PubChem hook up b) Google has scraped and indexed the JBC full-text via PubMed Central c) the fact that 2R is hitting process patents indirectly supports the identity of AZD5904 (i.e. they were intending to make lots of it) d) strangely, the 2S brings nothing back, e) the"virtual vendors" seem to do a good job of scraping up development compounds.

We can now move on to the patents. This week's trick is to use one search to pull back isomers by doing a similarity search with the canonical SMILES from the racemate C1CC(OC1)CN2C3=C(C(=O)NC2=S)NC=N3 (but this won't work for all examples). You can see this setting in the SureChemOpen interface and a subset of the results (below)

As anyone can reproduce this I shall only skim over the implications of this mosaic of the three structures across18 patents and 9 years. The first filing of the racemate was, as cited in the JBC paper WO2003089430 as example 14 (from 22), but a different IUPAC style (3 - (2- Tetrahydrofuryl- methyl) - 2-thioxanthine) and an IC50 of 510 nM (but presumably a different assay configuration). These patents form an extended series specifying the enantiomers in later patents on different series but also covering new uses and combinations.

To continue with circle squaring I got a neat result from looking the nearest neighbours(similar compounds) to CID 10354842
(see below)

This captures most of the key compounds in the papers, provides patent links and one of the PDB structures. There is an odd-man out in the series as CID 23658379 that also has a PDB structure but in this case by Pfizer in PMID 22352991 (this is another case where Googling the IUPAC hits a LinkedIN author) have done their own mechanistic enzymylogy on an analogue of TX-5. But there is also something odd about these CIDs, namely a complete absence of BioAssay links. This means that all the SAR data has slipped through the capture net because none of the papers appeared in journals typically extracted by ChEMBL (but they might pick up the JBC, we'll see).

So where does this leave us ? The good news is we have joined up a lot of information from the starting point of one tweet and a MeSH entry. The bad news is, there is some opacity with regard to matching structures against the biochemistry for three species for MPO (rat mouse and guinea pig) with different indications between the papers and patent filings. Last but not least I will surface some InChIKeysto make this post specifically findable in Google.

AZD5904 = TX-4 = RSPDBEVKURKEII-ZCFIWIBFSA-N

TX3 = RSPDBEVKURKEII-UHFFFAOYSA-N

AZ1= RSPDBEVKURKEII-LURJTMIESA-N

Update: 30th Nov. The latest G+ posts indicate data from the Sydney team are now being piped direct to ChEMBL so I'll take a look when it surfaces. This recent slide set from SE had a particularly pertinent diagram so I have added this below.

This piece wasinspired byconjunctionsrelated to recent G+ posting, froman OSDD Malaria Team Leader (see comment) and I quote, “ We're getting to a stage where we need to figure out how to best handle our compound/activity database more effectively while keeping it nice and open”. I posted a short comment on the end of the interesting G+ thread but a first conjunction was that I recently acquired a new PMID 22935805 as a comment "Shouldn't enantiomeric purity be included in the 'minimum information about a bioactive entity?" (yup, but we forgot). In essence MIABE is a publication check-list for “getting it out there” in form to maximise subsequent database capture (PMID: 21878981). The last conjunction was that findability is dependent on how people put it out there and the difficulties of this were encountered head-on in theexercise to link drug candidates, code names to structures for the NACATS and MRC repurposing sets (just published as PMID 23159359 but I'll review this in a new post when the final proofssurface).

There are many aspects to this theme but I shall just make a few points based on my own experience that I hope are practically useful. I will restrict this to the OSDD Malaria case as am fortunate to have some acquaintance with the data and its generation. However, I will start with a bit of philosophy. There is a viewpoint thatcan be expressed along the lines of " having toiled long and hard to produce these results why should we need to jump through hoops to get them into databases? “ Having negotiated such hoops myself for sequences and chemistry over the years, even as specialist, I found jumping through these far from trivial and therefore have complete sympathy with experimentalists who feel this way.

However, I strongly encourage them to revise this perspective for the following reason. Lets say, hypothetically, an OSDD team used just 1% of theirproject time for pushing structures and results out to public databases. I would argue this would be way too to low compared to the crucial value of having the fruits of their labour “out there”, globally linked and fully mineable. Even if this overhead increased to 10% or more this would certainly not be too much compared the 90% of B,S&T over the years of a project in producing and collating the data in the first place (there are many analogies here but PDB would be a good one). While the path to database submissions should certainly have lower resistance than it currentlydoes, the necessary time and commitment needs to be kept in perspective.

We can split the current OSDD challenge into two phases. The first one is initial public surfacing close to real-time, This was aptly described in the G+ post as the point when the Excel sheet on the Wiki begins to creak under the strain. Details of the IT options (including those mentioned in the G+ thread) are beyond the scope of this post but will doubtless be based on theElectronic Laboratory Notebooks (ELNs) they already utilise but extended by an automated push to some kind of surfacing database. I have no opinions here, beyond suggesting a) belt-and-braces representational sets (IUPAC, SMILES, InChI strings and InChiKeys) b) ensure these can be picked up by the Googlbot and chemicalize.org c) where possible, include an activity average. There may some equivocality between what is being designed (a.k.a virtuals) being synthesised, made, ordered, arrived, tested, ect, but given the relatively small numbers of structures we can trade a bit of rough-and-tumble for the benefits of real-time connectivity (e.g. early mining, collaborative facilitation and serendipitous hook-ups via Google structure matches). I know that ELN technology has the potential to push-button some of this and hook in stuff like synthetic schema and spectra, including a possible ChemSpider submission feed but these technical options need to be assessed by the teams on the ground.

The second stage of disclosure encompasses public archiving and updating. I can address this in more detail because it is the end I have been most engaged with. As we know, data becomes more valuable as it matures but needs to be accompanied by the processes of collation, confirmation, linking and updating that may take years. Regardless of whether the data set is large or a more conventional SAR study, the most effective process by far is a good peer reviewed publication that supplies detailed background information, full metadata, a MIABE-compliant description of the results and first-pass analysis. Obviously if team funds stretch to open access this helps a lot(including full-text indexing by Google).

The impediment to this dissemination route is obviously that quality publishing is not quick (unless you are a crystallographeraccording to recent tweets by @Richvn) but the team could explore, with the editors engagement, if the process could be acceleratedfor OSDD, for example, by faster reviewing and, for related result series, a template approach to drafting. I don’t see “minimal publishable units” as an issue in this case because chunks of robust data in a succinct style and with good metadata are more valuable (and looked upon favorably by grant authorities?) that a lot of the published stuff out there.

I am aware of options that bypass and/or supliment conventional publishing, indeed I have used figshare for this myself. The latest improvements look useful but until such time as reliable linkage between a deposition and major databases could emerge, it would seem unwise to rely on this for primary and archival deposition. Using it for preview or back-up is a different matter but immediately raises mirroring and maintenance issues (i.e. avoiding different data in different places). Teams with data sets too large to be directly incorporated into a manuscript have a number of options as to where their data can be instantiated. These include supplementary data sheets via the journals, group web sites, ChEMBL-NTD, ChEMB-Malaria Data, CCD and PubChem BioAssay but this is another theme.

As we know, many good things happen as a consequence of publication not least of which is PubMed archiving, reaching all points beyond (e.g. CiteExplore, GoPubMed, PubMed Central, Google Scholar. Mendely) as well as good ol’ MeSH indexing. However, the best thing that could happen to an SAR paper is to be selected for expert extraction by ChEMBL. In turn, this ensures the entry of the structures into PubChem and CitExploreas well as, crucially, the activity data into ChEMBL and PubChem BioAssay. At this point the hard working OSDD scientist might relax a little (crack a tube even) safe in the knowledge that their hard-won output is being “looked after” and that more good things will happenwithout their intervention. These include property calculations, computation of the 2D and 3D chemical structure neighborhood in PubChem (including patent space), integration within the ChEMBL and BioAssay schema as well as linked to the biological and bioinformatics relationships in ChEMBL and Entrez.

But, they shouldn't relax to much, because ofa secondphilosophical point closely related to one above, namely that whatever data gets "out" needs “looking after”(i.e. they should not submit-and-forget). Here again, I fully appreciate that this will not be second nature for most experimentalists or their PIs who are driven both by project timelines and constitutive enthusiasm to push on to new stuff. However,the best expert to cross-check not only the primary recordsbut also computed or curatedrelationships connecting their data to other records (even accumulating over years) isnot the "crowd" in the first instance, but rather the individuals who generated the results and wrote the papers.

I am thus suggesting that OSDD scientists should rush in where medicinal chemists have hitherto disdained to tread. Indeed, as a logical corollary of openness, linking and provenance, it is very much in their interests to do so. In particular (but comments are welcome if I have unjustly overlooked where this has happened) pharmaceutical companyor academic medicinal chemists are not known for correcting public data entries for their own structures, even where the “system” has introduced erroneous structures, synonym mix-ups or a swathe of automatically-generated spurious relationships, leaving us PB-users to our own devices and head-scratching. Errors of all types have, and will, occur in the result generation > database record flow but OSDD scientists could engage directly to fix what is importantfor their data. For example, ambiguities or omissions in the paper that lead to automated or curatorial errors (e.g. species designations, assay units or fuzzy wedge bonds) can be retro-fixed in the entries (even MeSH links?) .

For sure, this counsel of perfection (not "council" BTW) raises many caveats around the processes by which this could be achieved. These also need to have low barriers while ensuring promptness, authentication, and systematic transitivity (i.e. if it gets fixed in one database, few folk know the fix is correct so it stays wrong in the others). As an practical example, maybe a pair enantiomers needed some effort to resolved and were only tested post-publication. It should be possible (and would be in the spirit of the MIABE comment) to add the two new structures and activity values to the data set.

To offset the pontificatory stuff above I oughtto ensure my postsinclude at least some real data. What turned out to be perhaps an unexpected "connection-chain" bonus of the OSDDMalaria open data sheet was that it chemicalized off-the-bat (see below).

While generating 135 structures from 98 rows is too much of good thing, this is explained by the automatic dictionary look-up and structure conversion for some of the chemical homology terms used in the name column (e.g "pyrrole"). The total conversion hadtwo useful consequences. The first that it was easy to load all 135 SDFs to the PubChem search interface and get the result below.

I can supply more detail on these numbers to interested parties but you can see the results from my custom filters in the top right of the picture. In summary, 93 of the 135 were already in PubChem via the chemicalize source. Of these, I have selected (via my filter) the 22 unique OSDDMalaria structures and stored them as a PubChem public collection. More is revealed in the two diagrams belowfrom the most potent of the Sydney structures CID 57515644 = OSM-S-38 at 1-7 nM IC50.

Interestingly, the only match in Google for the InChIKey is GVGNOLWIUGQIHW-UKWGHVSLSA-N is ChemSpider CS 28296460 but this is an "orphan" with no source links. At this point it is necessary to explain that chemicalize.org currently has 297,082 stuctures in PubChem, of which 97,032are source-unique including the 22 indicated above. Selecting any of these CIDs, by definition, will haveonly one substance entry, in this case SID 137266234 (top picture) that links through to chemicalize entry 467188(lower picture) but the connection is actually made via an InChI stringcall-out. This squares the circle and takes you back "out" to the Google Docs data sheet we started from.

There are useful aspects to this connectivity. The most important is that you can locate activity results, synthetic design, or other other data theteam have added into any of the open sheet columns. Cross-mapping is a bit fiddly, but you need to download the sheet and connect the CID rendering to the chemicalize rendering in the correct cell, or work the other way round by searching a specific structure from the sheet "back" into PubChem. Importantly, while these are "data-less orphans" in the PubChem record sense you can connecttodata in the sheetvia the chemicalized URL(so don't change the address BTW). In this case,the OSDD Malaria team are updatingfairly close to real-time but synchronisation will depend on the time between when a user re-chemicalizes the Google Doc if new structures are added and the approximate monthly updates ChemAxon make in PubChem. Notwithstanding, as a connection route to SAR data this is non-ideal compared to if the records had come in via ChEMBL and thereforewould have both BioAssay results and the Sydney team's code names as retrievable synonyms in the CID records.

Now, you may be asking "how on earth did this get into PubChem when the chemicalize source has today's date? We need to look at the SID dates for an explanation (below).

To cut the story short, someone (probably me, but not necessarily) submitted a URL containing that structure to chemicalize some time before the first batch submission this August. This has been updated (re-submitted in this case) twice, hence we see three SID versions with the same structure. However, at run-time for the chemicalize call-out via the SID, you will see only the most recent of entry if the URLs are redundant. We thus see today's date as picked up from the generation of the first picture in this post.

Now, it gets neater because you can now do a "similarity walk" from the chemicalize entry for this strucuture (1st pic below) and click on the entry at Tanimoto 0.617that has 2 pages (2nd pic).

More stories to cut short but this is a patent that I hadhitvia PubChem searcheswith OSM-S-38 back in the summer and I thuschemicalized (and pointed the Sydney team to the results in a Synaptic Leap post). Note here we see two non-redundant URLs because, while testing some options, I happened to surface two file names on Google sites. The structure from that patent, InChIKeyUNSGQWFLWHHHQO-CXUHLZMHSA-N connects round to CID 1227508. As it happens I had recorded similarity hits to other analogues from the patent, including CID 1359148. This has three patent source links but also a selection of chemical suppliers. It is thus not clear by what route these compounds got selected by the vendors, or if it was just prior-art in the patent.

Updates, 11 Jan 2013, a) an orthogonal backstory from one of my esteemed co-authors b) I am pleased to be involved in the planning of a repurposing workshopat BioIT 2013, April, in Boston.

One of the benefits of blogging is tobe able to add extra bits of context to your own papers, in this case "Challenges and recommendations for obtaining chemical structures of industry-provided repurposing candidates" (PMID: 23159359). The genesis of our effort was that cross-blogging by myself and Sean, followed by slides from Antony, not to mentiona few tweets on the same theme, eventually progressed to direct e-mail and Skype chats between the three of us. The fact that we were acquainted made this easy and while I can’t actually remember which of us first suggested a joint publication the other two affirmed the idea in short order and this was clinched by positive response from the journal. We hope you enjoy the result, feedback by any route is welcome, and the following notes may be useful;

The version you can pull down now is proofed. On the grand scale of public embarrassment our un-proofed version would not rate that highly but you don’t need to see thesprinkling of typos so please substitute the final one. I also have to take responsibility for a grammatical clanger now enshrined in our PubMed abstract (if you didn’t spot it – fine). We fixed this in the proofs but advanced access let it through so I now need to attempt a retro-fix. While I am in favour of open access I don’t feel the need to apologize for this being behind a subscription wall at the moment because I cannot afford OA fees. Options are being explored but we’ll have to wait and see.

While more time may be needed for the ”welcome to contact us” message from the paper (and in our blog posts) to spread around and about, there have been no responses from any of the companies concerned in regard to ”their” code name-to-structure mapping (CNTSMs). My expectations for this happening were not high but, giving the benefit of doubt, some scientists might have been willing in spirit, but there may simply not be any official procedure forthis.

For the record, only one synonym was specifically added to PubChem, by me as a proof ofconcept, for AZD 1656 (SID 136946384) . I may add more but a) enhancements of the small-scale curated submission system in PubChem are imminent b) I would need some explicit encouragement (a couple of beers at least) for the effort c) as is implicit in the paper, doing this ”backwards” is in principle, not the optimal process and d) there is a likelihood that one or two CNTSMs could be wrong. Submitting these to PubChem would consequently saddle yours truly with the directly attributable provenance of a false-positive (i.e. my name in the SID record).

Data and updates: By definition we cannot update the supplementary data from the journal site but it is "fresh" anyway. I am not expecting a lot of new CNTSMs but some may dribble out in new publications. What I have done is updated the September figshare data sheetto now include both the NCATS and the MRC data (note also that most of the SureChemOpen patent links are now available in PubChem). I was intrigued to see the figshare hits exceeded the blog counts. For the NACATS sets this was 581 vs 479 and AZ/MRC sets these were 285 and 284. There are two other routes to the structures. The first is via the Open_Drug_Discovery_Teams mobile appfrom which you can see a screenshot below.

In the drug repurposing section of the installed app if you click on it in the RECENT section you will find the tweet which if you click on it will show all structures. The second way to pull the structures (but unfortunately not linked to most of the CNTSMs) is the MyNCBI public collection linkreferenced in the paper to 41 CIDs. Those looking for even moresuper-ferret details might care to look at the story for AZD 5904.Those interested in the in silico modelling data sets are welcome to contact Sean.

Since we submitted the paperengagement and publicity levels in the two overlapping themes of repurposing and clinical data transparency have risen sharply, especially since there are at least two advanced access DDT articles (unkowingly in parallel with us so we could not cite them) for the former and @bengoldacre continues toTweet heavily on the latter (the "Guardian" articles I can give credence to, but the "Mail on Sunday"-c'mon). Other relevant updates include the catchy byline ”Roche, Broad band together to bring failed drugs back from the dead”. In this case no less than a 20-year back catalogue of 300 compounds will go into the Broad’s screening collection (but no details on CNTSMs or data release in PubChem BioAssay). My tweeted response to this was - why don’t all companies just drop their equivalent development back-catalogues into PubChem and the MLSMR collection forthwith, see what results flow into PubChem BioAssay, and who might eventually knock on their doors with interesting collaborative proposals.

The AZ_MRC collaborations have not only recieved a plaudit but AZ has awarded 7 million funding for 15 research projects. This attests to the sucess of the call and quality of the accepted proposals. However, it also inspires curiosity as to what compounds made it throught, why and when any data might surface in the open. I also found this AZ video added useful context.

The by-line for this post could be "from famine to feast"butcontextis available from theSureChem press release (SC), the SureChem blog and aPubChem news flash (PC). I also wrote a short piece on the prescientNature News announcementback in April (including some speculative CAS substances: CID factoring). Moretechnical background can be found in previous posts on PubChem-pips-9-million-patent-structures and the analysis of PubChem-patent-sources-mw-slice-n-dice. The new submission by SC, completed over this past weekend, pushes PC to 46,697,871and the total contribution of the four major patent sourcesto 14,507,457, of which 8,368,807have an SC link in the CID. The first analysis wecan look at (below) is overall numbers (blue) and unique content as defined by single-source CIDs (red).

Thus, 31% ofCIDs now have at least one patent extraction link(altough this would be slightly higher if the smaller sources with patent content, such as SLING and chemicalize.org, were also factored in) and 16% of these are from patent-only sources. Of the four major sources SCmakes the largest absolute and unique contribution of 4,592,202. Below are successive intersects between SC and other sources, expressed as a % of those sources.

The observation that SCRIPDB shares the highest proportion of content with SC is not unexprected as they are both automated Chemical Named Entity Recognition (CNER) pipelines with extensively overlapping document corpora. The IBM overlap is expected to be lower as the shared documents only extend up to 2000. Note that 1-in-five of ChEMBLhas an SC match which we would predict to be the patent-then-publish subset. This drops if the non-literature assays are encompassed in the actives. The low intersect with Thomson Pharma warrants further investigation but may be related to isomer and tautomer differences between manual and automated extraction. Below you can see sources ranked by the PC query "covalent unit count = 2".

We could expect the majority of these to be salts but the query cannot discriminate these frommixtures and dimers. What surprised me slightly was that ChEMBL was one of the higher ones but it is not uncommon for active compounds in journal papers to be specified as salts (typicaly chlorides and acetates). Inspection of patents shows this also to be the case for many examples so the SureChem pipeline may be particularly effective at recognizing these. Inside PC it is possible to assess "quality", or at least an arguable proxy for it, by counting partially or unresolved chirality and E/Z forms (with this combined query). The results for patent sources and controls are shown below.

The first features to note is that PC has,on average, ~ 1/3 of all structuresselected by this combined filter (but dominated by 13 million unspecified chiral centers). Even sources such as ChEBIwith a documented high standard of expert curation (including time-per-compound), still has 17%. The much larger manually extracted collections of Thomson Pharma and ChEMBL, come out somewhat higher. Not unexpectedly the CNER sources have the highest content but this is clearly dependent on the source IUPACs (e.g. the explicit use of R and S) as well as the automated text string recognition rules. We can extend the PC internal statisticsto look at the union of all major patent sources. Below are shown the CID intersects, expressed as absolute numbers from the totals included.

The ROF + 250-800 Mw filter is an approximation to "lead-likeness" typical of most examples with activity data from medicinal chemistry patents. The MeSH pharmacology set is extensively covered and note the 97% match to established FDA drugs. Whilejust over 48% of ChEMBL now has a nominal patent match this drops to 28% if only the 11.8 million union of SC, IBM and SCRIPDB is used for the intersect. The 10% difference is likely to be the journal-only content in Thomson Pharma. Note also the intersects shown above are exact matches, so by extension of the similarity envelope the coverage in both vendor and bioactivity space that can be utilizedwill be considerably larger. The next analysis is a Mw slice of the four sources, compared with ChEMBL and the whole of PC(below).

The first interesting feature is the bi-modal blip in PC as a whole (probably related to chemical vendors but I'll look at this later). Looking atSureChem (in red) the pattern is most similar toThomson Pharma. Thecompounds with SAR from papers in ChEMBL have a sightly narrower spread of Mw, but note both these manually populated sources are picking up content above 1000 (mainly peptides). You can see the intersect for these large entries in Thomson Pharma and ChEMBL in this link (but order the display by downward Mw). If you do the same for the two automated CNERsubmissions not filtering Mw at source, SCRIPDB and IBM, you find 224 "strange big things", but mostly from non-pharmaceutical patents.

There are many aspects to this expansion of patent data in PC that will need to be explored, analyzed, assessed for utility and cogitated upon in the future. However, yesterdays jump to 14.5 million is a tipping point in that availabilityhas crossed over into the majority. Regardless of how close this number may approach "all" useful chemical structures from patents, the extent of public database linking is now unequivocallyon the side of"most". Updates to the SureChem submission in 2013will include pipeline improvements and back-file loads from images and USPTO complex work units.

Update: Sequel post antimalarial-target-deconvolution-part II

To deconvolute , or not , that is the question: Whether 'tis Nobler in the mind to suffer The Slings and Arrows of outrageous target-centricity Or to take up phenotypicscreening against a Sea of unvalidated targets ?

This argument has come to the fore recently in the context of proposed solutions to increase pharmaceutical R&D productivity. Articles on this theme are appearing with increasing frequency in NRDD and DDT. A suggestion that has come up more that once is that there should be a return to more phenotypic screening for selecting of leads. There are number of reasons offered for this but the primary one is that the past two decades of molecular mechanism of action (mmoa) approaches, underpinned by HTS, have been excessively target-centric. A secondary reason is that some assumed specifically targeted drugs turned outto be polypharmacologic in vivo. The data in support of these arguments seem convincing but, understandably, no one is suggesting the one approach should completely supplant the other. There are also obvious payoffs, given sufficient resources and luck, for experimentally elucidating mmoas in parallel with leads optimised from a phenotypic screen. This can be conceptualized as “reverse” target validation (i.e. here’s the in vivo lead so lets elucidate the mmoa) as opposed to “forward” validation (i.e. here’s somein vitro target-specific modulators, lets see what they do the disease models).

So what does deconvolution mean in the antimalarial context? For tropical parasites in general this has been reviewedin "Finding new hits in neglected disease projects: target or phenotypic based screening? but we can focus on specific aspects.One of the assays widely employed in current antimalarial research uses levels of P. falciparum lactate dehydrogenase as surrogate of parasite growth. This is a classic “bag of targets” phenotypic assay in that growth cessation could involve any one (or combinations) of the 100s of plausible mmoas that could bring parasite metabolism to a grinding halt at different cycle stages. While its true that some established anti-infectives do not have an unequivocally established mmoa (e.g. Salvasan) we can list arguments for doing the experiments to discern the molecular target that is causativelyrelated to clinical efficacy.

1) The FDA and other regulatory authorities do not mandate the mechanistic underpinning of an effective medicine. Nontherless, they would prefer the submission data package to be pharmacologically interpretable in the context of a mmoa.

2) The necessary mechanistic toxicology studies are at least partially facilitated by having a mmoa. For a future candidate it would seem a paradox if pharmaceutical companies typically run in vitro safety and side effect testing assays for at least20 off-targetsduring pre-clinical assessment but would be unable to run an analogous panel for human homologues of the plasmodial target for the early addressing of specificity and dosing.

3) Developing a robust in-vitro target based HTS-compatible assay has advantages over a parasite cells assay including a) the experimental error is much lower b) it is more standardisable and transferable between labs and c) it could allow the screening of not only public (e.g. Molecular Libraries Probe Production Centers Network) or vendor-based collections but also those from enlightened companies. These could have the capacity to help out with HTS slotsbut would be less inclined to tool-up for a parasite assay. Hypothetically, just a handful of the larger companies could cover an aggregated chemical estate of at least ~ 10 million, of which maybe ~ 6-8 million could be novel structures. Because a downstream triage is already established (e.g. parasite assays, HepG2 cellular tox screen and infected mice) any exploitable new series could be progressed.

4) Ligand-based virtual screening and SAR exploration have beenprogressed using whole parasitescreens alone. However, they run the risk of being confounded by high variability of IC50 results, polypharmacology (multiple targets) or cryptic mechanistic shifts (target switching). The same argumentation would imply that the pharmacophore models or clusters from high-quality in vitro SAR data are likely to be "tighter" and more suited to in silicodrug re-purposing.

5) An entire domain of kinetic and thermodynamic optimization is inaccessible without knowing the primary binding target (e.g. Kon, Koff, target residence time, enthalpic/entropic binding ect, see "Target-drug interactions: first principles and their application to drug discovery" PMID 21777691)

6) Any target-specific compound series becomes a “system probe” and may validate particular pathways that can be explored for other intervention points or designed-in polypharmacolgy.

8) In addition to the obvious comparisons to mamallian sequences orthologues in Plasmodia and other parasites can be tested for exploitable cross-reactivity and the genomic variation between isolates can also be used to asses resistance propensities. Note that the ability to do bioinformatics associted with this is binary in the sense you can do a great deal with a target sequence but nothing without it.

9) On a very good day protein targets can be crystalized with ligands that are also effective in vivo. The provision of an X-ray structure (or a good model at a push) opens up not only docking as a VS option but also the possibility of fragment based approaches (e.g. the recent collaboration of the UK academic 3D Fragment Consortium with SureChem to analyse patent space)

10) In the longer term experimental de-convolution synergises with in silico de-convolution. One of the practical translations of this will be when more in vitroscreening data sets from a wider range of at least partially validatedmolecular targets, together with a broader chemical diversity from whole parasite results, are both deposited in public databases. The consequence should be that the chances of these separate approaches intersecting at the 2D and 3D chemical structure level increase.

12) New drug combination options will need be explored for future antimalarial therapies. Target identification can put this on a more rational basis (e.g. hitting two pathways or a double-hit for one pathway)

13) The OSDD model changes the game compared to classical competitive commercial drug discovery in many ways. One of these is that everyone gains synergistically from openly shared scientific information on the progress of parallel efforts. The second is that thecollective long haul for Malaria has the objective of bringing through aselectionof good drugs over many years. Given this, with an established mmoa it would be faster to develop back-ups and follow-onsusing HTS as the top of the triage. Thus, if global OSDD efforts worked up, say, 5 different chemical series for one target this would be a good thing (you could get a clinical log-jam but that’s a different issue). This scientificallylogical outcome has little communal benefit in the classical competitive model because of a) at least 2 years of mutual blinding because of the IP “information shadow” b) only the few first or best “win out” commercially with all other efforts effectively wasted c) every good series accumulates a blocking patent thicket. We can illustrate this with ACE. The development of new inhibitors has largely ceased because cheap and effective generics make this commercially unattractive, not because it would be scientifically impossible to make improved drugs if IP and funding constraints were removed. In contrast, the OSDD concept of abrogating competition to encourage rationally diverged efforts, could multiplex on a global scale (e.g. 5 groups > 5 targets x 5 series each). This could be more difficult to achieve if phenotypic screens were at the top of the triage (they will always be at the bottom by definition).

In a subsequent post (antimalarial-target-deconvolution-part II) I will reviewsome of the options for infering and elucidating mmoas

Update: b) I enjoyed the recent onlineOpen Meeting of the Open Source Drug Discovery for Malaria Project where some of this was briefly mentioned but may be deliberated on a future occasion, b) a new relevant paper has appeared on the graphical exploration of chemogenomic space covered byChEMBL (PMID 23257198)

OK, so having outlined theadvantages of identifying the molecular targets for a malaria whole-parasite screen how does one go about it ? (In terms of personal cred’ many moons ago I was involved in "Isolating and identitying a protective antigen of Schistosoma mansoni" (PMID:3284744). Not unexpectedly this problem for plasmodium has been recognised and addressed in several publications including”Target identification and validation of novel antimalarials” (PMID:21707315) and ”Identification of inhibitors for putative malaria drug targets amongst novel antimalarial compounds” PMID 20813141 (these articles had access quirks in that both dois send you to publisherpaywalls but open access PDFs are available from a reagent supplier in the first instance and PubMed Central in the second). These and related articles cover a lot of ground that need not be reiterated but what I can do here is add some context and connections that I hope may be useful in general and for the Sydney team in particular.

1) In silico deconvolution: This is based on the ”similar compounds act on similar targets” assumption. Experimentally confirmed successes for this approach are not that numerous but there are precedents from other systems (e.g. "Chemical informatics and target identification in a zebrafish phenotypic screen"PMID 22179068). This approach has been already explored for malaria in "In silico activity profiling reveals the mechanism of action of antimalarials discovered in a high-throughput screen" (PMID:18579783) in 2008 but this did not include experimental confirmation. More recently an approach has been used to computationally predicted protein targets from compound structure information using statistical chemical structure models for ChEMBL protein targets. Then matching is done against parasite protein chemical clusters. The useful new Uniprotmappings toChEMBLallow species cuts to be made which establish that, of the 5,798 targets 2,623 are human and 70 from Plasmodium species and isolates (although the absolute numbers of active compounds will be biased towards popularhuman targets). These networks/clustering/modelling approaches are certainly worth pursuing and become more powerfulas SAR data mapped to UniProt IDs accumulates with each ChEMBL release.

The basicimplementation of this principal is to simply cross-check potent hits from parasite screens that have reported activity against human targets from papers, patents or PubChem BioAssays. The original GSK paper explored this theme (PMID 20485427)because their screening hits were biased towardscompounds active against a range of historical human targets such as kinases, GPCRs and ion channels and some of these structures could be mapped back to patents. Here again its unclear what precedents we might accept as an experimental proof of concept. The problem is the orthology gap between humans and Plasmodium(strictly speaking just homology as there will be few 1:1 true orthologues). This is nicely exemplified using estimates from TimeTree.

Even taking the lower estimates, a billion years divergence time is good news and bad news for antimalarial projects. The good news is that a) the sequence identity between an "average" human and plasmodial enzyme or other molecular target is likely to be low enough that cross-reactivity may not be an issue (or can be designed around anyway) and b) parasite physiology probably has therapeuticallymodulatable pathways with molecular componentsunique to that phylogenetic branch. The bad news is that inferential cross-species extrapolation of a mmoa becomes unreliable. For example, a series of human kinase inhibitors could show potency in the parasite assay but may have a parasite mmoa that has nothing directly to do with kinases.

2) Cross-docking: The malaria at home project (to which you can donate CPU time) has the ambitious plan to dock active compounds from parasite assays into any Plasmodium related structures with PDB entries or models with a reasonable template match. The compounds will mainly from the GSK and Norvartis result sets butother sources as well) The statistics of this effort are impressive, to dock each of the 18,924 hits into structures of each of the 5,363 proteins. The problem is that coverage of plausible targets with a defined active site pockets or models of a quality high enough to give any credence to the docking score rankings, is unlikely to be high. The reason is because there are only 80 Plasmodium structures in UniProt. I also think they could have done some small-scale piloting with internal controls and selected potent ligands. Notwithstanding, this is more power to the collective elbow and we can look forward to the results as well as cogitatingon what options for the confirmatory experimental triage would be most efficient.

3) Iteration between cells and targets. A practical key to expanding target deconvolution lies in lowering the technical and logistical barriers to iterative testing of compounds between a wide range (i.e. panels) of purified Plasmodium protein target assays and whole parasite screens. This obviouslyneeds to be reciprocal with protein target hits tested against parasites and vice versa. Any high-scoring in silico predictions from chemical clustering as well as the malaria-at-home cross-docking (plus some low ones as controls) would have to be verified by this type of experiment anyway which is another a good reason for setting this upon a global OSDD basis. As mentioned in a Synapic Leap post (Patent structures as reference compounds) selected lead compounds from patents and publications could be added into this process along with known antimalarials as internal standards. The end result could be a valuable ”mmoa bank” as an adjunct to the MMV Malaria Box However the, authors of ”Identification of inhibitors for putative malaria drug targets amongst novel antimalarial compounds” deserve credit fortelling us in no uncertain termshow this approach turned out for them viz "(our) data do not support our original proposition that target-based drug discovery can be accelerated by starting with cell-active compounds". The more optimistic conclusion would be that they had the tough luck topermutate the wrong commercial compounds against the wrong TDR-selected targets but they also discuss important methodological points that anyone else embarking on this difficultexercise needs to take heed of. However, I would ask the authors in their subsequent publications to please add standard chemical descriptors and UniProt IDs to what were image-only depictions in results, an example of which is shown below.

For the record PfSAHH = S-adenosyl-L-homocysteine hydrolase = P50250 and CID 6764 happened to be the only compound in the paper for which the enzymeand cell results were the "right" way round. There are some interesting connections here. This paper has not yet been extracted by ChEMBL so this particular target-chemistry pairing is not captured in PubChem BioAssay. However, the earlier parasite screening data is there, from Novartis AID 449703 and independantly from NCGC (AID 504832) but with just a tad lower IC50 of 0.64 uM. The P50250entry not only connects to the PDB structure for the enzyme but also to the ChEMBL entry for inhibitors from otherpublications. The other point, that I can immediately infer from the low CID number, is that this is an old compound, in fact the MeSH term forphanquinone goes back to 1972 but what may be more significant is that this has been clinicaly tested as an antidiarrhoeal drug (repurposing anyone ?).

The last aspect of target-to-cells I shall cover is to pick up on the recent successful case "Discovery and biochemical characterization of Plasmodium thioredoxin reductase inhibitors from an antimalarial set" (PMID: 22612231).I have not dug out the full paper from the Uni Library but have assumed oneof the seven leads was in the graphical abstract (below), which, via OSRA, I mapped to CID 44528586.

Thus, as a methodological exampleof cell-to-target iteration, the GSK team have re-screened the actives from the parasite screen against one candidate target and successfullyidentified leads. However, the enzymefeatured in the paper P61076 is not yet picked up by ChEMBL but is to linked to earlier publications (but no PDB entry).Notwithstanding the orthology complications related to mammalian paralogoues of 1,2, 3 (one of) the human alignments illustrates the divergencenicely (below, active site in red)

In this case the 45% identity (with gaps) gives room for finding or designing inhibitors with Plasmodium specificity plus acceptably low human cross-reactivity. This would also be a useful benchmarking example for the ChEMBL SAR model comparison approach as there are afew published human thioredoxin reductase inhibitors designed to overcomecisplatin resistance.

Omicssignature comparisons. Fivethings this general approach has going for it are 1) it is "hypothesis neutral" 2) it could consequently elucidateunpredicted new mmoas 3) the experimental profiling methodologies are continually improving in resolution and sensitivity 4) they can (and should) be corroborative via // measurementsof transcripts, proteins and metabolites and 5) the undertaking can be virtuously circular (i.e. a newly elucidated mmoa may signature-cluster with other new or old compounds)The underlying principle is tofollow changes on adding a growth inhibitor. Time courses can provide the best resolution, ideally also where the perturbation can also be experimentally reversed (e.g. pulse and wash-out) and the systemprofiled continuously as it reverts back to normal. It is clear that omics profiling of malarial parasites istechnically demanding, fewfacilities will have the necessary instrumentationplus expertiseand, whatever inferences can be made from the results would need orthogonal experimental consolidation. However, there are dozens of new leads appearing from just the recent literature and patents alone. Few other experimental approaches can offer the chance of unifying mmoa classification by // profiling and clustering.

Annotation gaps. A direct measure of the curation status of any particular genome is the Swiss-Prot:TrEMBL ratio for a complete UniProt proteome set of predicted and cDNA-confirmed ORFs. The problem is that for the Plasmodium falciparum (isolate 3D7) this is very low at 144:5,210. This has many consequences but the fact that only a fraction of the proteome has been curated and annoted to Swiss-Prot standardsdoes not necessarily imply that the remaing 5000 have never been eyeballed and analysed by any expert but it does mean there is no formal capture of such activities. Beyond TrEMBL there are other sources of automated annotation (e.g. PlasmoDB ,TDR Targets Database and the Plasmodium falciparum 3D7 homepage on GeneDB). However, what the community really needs is expert manual functional assessments, pathway assignments, literature links etc. Transitive automated assignments are an essential part of this triage but will make a pigs ear of an unknown but substantial proportion. Consequently, many potential drug intervention points (possibly already being targeted by cell assay actives) are being overlooked. In addition the information yield from omics profiling experiments is directly related to annotation quality and depth.

I had hoped to conclude this post with an assessment of how many defined antimalarial protein targets had been captured in PubChem BioAssay, with a view to exploring intersects with parasite growth assays. However, this quickly revealed some good ole' x-mapping "spaghetti" for no less than304 protein targets andso this will have to wait for another time.

Update 24 Jan. A new CollabChem post (I think we were writing in // between the time zones!) describes using the TB mobile App with this data. The neat thing is you can now drop out the 11 prioritised cpds from the paper from MolSync via Dropbox (pic below). Analogous to the description below you can then upload these 11 SD files to PubChem to get 5 CID matches (pic below), as well as exploring the similarity neighborhood via any single SD upload. Just for the record I have now run all 777 SMILES from the first GSK worksheet to generate an open PubChem collection GSK TB 522 from 776. There is also a comment (at the end of the post) that compounds from this GSK release will be in ChEMBLdb release 15.

*****************************************

We can doff our hats again to GSK Tres Cantos for releasing a public data set for a second NTD, this time for compounds active against Mycobacterium in: "Fueling Open-Source Drug Discovery: 177 Small-Molecule Leads against Tuberculosis" (PMID 23307663). There are a lot of things we could pick up here but as GSK are promising more results from these hits it makes no sense to pontificate until these appear. What I can do is outline a first-pass triage. As a consumer/user of such data I can also point out minor imperfections but these are offered constructively (i.e. not looking a gift horse.......). The interesting precedent is the data set that can be downloaded as an Excel sheet from ChEMBL-NTD. The authors include a supplementary data PDF but its not explained how the table in this relates to the Excel layout. The good news is you can see the 177 images (my guess these were dropped out from ChemAxon's Jchem-for-excel) the bad news is this makes the PDF like treacle to navigate, on my PC at any rate. It would also have been useful if one of the Excel worksheets had been laid out the same way as Table 1 from the manuscript.

Because ChEMBL-NTD is a repository (i.e. not a database) in this case it is serving as a pick-up point. As valuable as this is, in the goodness of time it is to be hoped that (analogous the GSK malaria data precedent) this set, and/or from any subsequent manuscript, gets instantiated as a PubChem BioAssay (directly, or indirectly via ChEMBLdb). Since a recent ChEMBL/GSK paper has explicitly recommended the linking of chemical structures to bioactivity data in public databases (PMID 23088264) I'm sure they are collectively on the TB case.

In the meantime, we have to import and transform to do anything, including intersecting with major public databases.   Now, I'll wager that GSK have the 177 structures x-mapped against all sources you can possibly imagine, but it would not have been much extra work just to have added PubChem CIDs in the data sheet for the ~ 65% with matches. So what can we do in the interim for some preliminary slicing and dicing ? My first step was to put the 177 SMILES up on an open URL to run chemicali ze.org. You can see the result below.

Chemicalize converted 172 of the 177, plus the ATP in the column heading. OK, so we now download the SMILES and upload them to PubChem to produce the result below.

I don't have time to go through the triage details against my own MyNCBI f ilter s (top right) or the PubChem default ones (bottom right) but there are some interesting cuts here. You can also "do this at home" because, in the spirit of OSDD, I have instantiated my chemicalizable lists, including all those referred to below, into https://sites.google.com/site/cdsouthantest/home . In addition the 117 CIDs matching the 177 GSK (parent) compounds are now in a public MyNCBI collection (GSK_TB_Jan_2013_117)

Note the matches from this set are low on patents (16), low-ish on literature results (26 and mainly ChEMBL) very low on known pharmacology (1) but high on vendors (102). So, I hear you ask, were the "missing" 60 matches a) propriatory GSK structures once-upon-a-time - or b) they are not canonically novel but, for whatever reason, the chemicalize transformation from the GSK SMILES strings won't match existing CIDs ?

We can explore this with some cross-checks.

1) The first thing is a round-tripping control. This means uploading the different SMILES flavours to PubChem search direct from the GSK Excel file (i.e. not via chemicalize). This reassuringly gives exactly the same result of 117 CIDs

2) Dropping the search stringency ? - this does not work because you get a complete neighborhood "mushroom" for the know CIDs as well as the novels. For example, even a 99% sweep for the 172 times out but eventually hits 654. Although "same connectivity" is more manageable at 186, if you think about it, this can't work either because the known CIDs will also have multiple matches.

3) What I chose was to bring in the very useful Venny. Let me show the result first and then explain how it was generated (below).

This is basically intersecting the 177 "up" SMILES with the 117 "down" SMILES to partion the subset of the 177 that were not in PubChem. The key here is you need to normalise the SMILES strings by runing the 117 CIDs "back" in chemicalize. Then Venny will simply do the set intersects (OK you could have done this in PP, JMP, Acess etc but this was quick). Granted, it has not worked perfectly (i.e. there seems to be an overcount and the 15 should not be there) but we have isolated those SMILES strings that are PubChem -ve.

4) Next up is to chemicalize just these 71 (below)

5) The problem is the PubChem upload gives 15 matches, when it should have not given any so I repeated the whole process with InChI strings. This worked OK not only because GSK SMILES > chemicalize InChIs are the same as the PubChem ones but also Venny coped with the long strings just the same for set identity matches. While I was still left with some discrepancies I ended up with 61 PubChem -ve InChIs (if anyone wants to explore these anomalies just get in touch). You can see the results below from the mysite url (gsk_tb_notpubchem_61_inchi) and you can repeat the chemicalization to take your own downloads.

For reasons I can't work out (as mentioned above) 5 of the 61 still have CID matches. Nevertheless, at least a proportion of the rest are bona fide "novel" structures (i.e. canonically PubChem -ve) originating from the Tres Cantos screening deck of ~ 2 million. You can see one example below.

On the right is the chemicalize instant archiving of the web pages generated for this blog. This structure has only a single 90% match in PubChem to one of the GSK antimalarial hits CID 44527169. This attempt to track structural uniqueness in the GSK TB set is not only clunky but you also loose register with the code names and data mappings in the original sheet. However, this faffing about would be obviated if, taking the precedent from the fully-linked GSK antimalarials in Assay ID 2306, the structures had similarly been processed to unique or pre-existing CIDs. Fortunately, in the interim, chemicalize.org will submit the structures to PubChem (probably within a month of this post). We should thus end up with close to a full house of 177 CIDs but the links to data would only be indirect via my web pages.

Beyond this wait-and-see option we can still try some quick slices and dices. A logical first aproach is, in PubChem syntax, [Compounds, activity concentration at/below 1 uM for PubChem BioAssay (Search "Mycobacterium" AND (pcassay_pccompound_activityconcmicromolar[filt]))]. Roughly translated this means "get me CIDs that were hits of moderate potency from any assay with Mycobacterium in the title". The answer is 3,707 (which unfortunately exceeds the MyNCBI public collection saving cap of 1K). The next step is a simple intersection with the GSK 117. The three in-common are shown below.

It's not a lot but it demonstrates the approach works because they are independent confirmed cell-screen actives (and are purchasable from suppliers). The links, definitely worth exploring are, from the top, CID 322727, CID 1266902 and CID 665824.The last one is notable because this compound was in the Molecular Libraries collection (MLSMR) (and obviously the GSK Tres Cantos collection) so, as you can see below, we can get a cross-screening snapshot.

We can discern here the encouragingly clear specificity for the confirmed TB cell activity of nearly 20-fold over the other 690 assays (and 11uM is hardly "active" anyway).

As my last trick for this post I made a preliminary attempt at target deconvolution by trying to track down Mycobacterium purified protein assays. The bad news was the simplest query ("pcassay_pcsubstance_active"[filt] AND pcassay_protein_target[filt] AND "Mycobacterium"[TaxonomyName]) produced zilch. It turns out there is a protein target species indexing problem (I think) related to mixed source database title names. For example the assays retrieved above certainly included purified targets (e.g Competitive inhibition of Mycobacterium tuberculosis DHQ2), But, because this one came as a literature extraction from ChEMBL, the BioAssay system parsed the UniProt name as "RecName: Full=3-dehydroquinate dehydratase; Short=3-dehydroquinase; AltName: Full=Type II DHQase" as opposed the the Entrez RefSeq name "3-dehydroquinate dehydratase [Mycobacterium tuberculosis H37Rv]". Thus it won't retrieve as a species target.

Not to be completely outdone I tried a retrival of ChEMBL targets by species (Mycobact) via UniProt. This worked fine, as you can see below. However, I'll have to leave the problem of pivoting the chemistry from these entries, and trying some different approaches in PubChem BioAssay to isolate Mycobacterium protein target actives for another day.

For those of you new to chemicalize.org it is a unique open application that recognizes different types of chemical names in text sources and converts them into structures. It can thus disinter many millions of these from their document toombs. It not only features in some of my own posts but there is also a ChemAxon UGM video, from yours truly, and the resource has its has its own blog. Yesterday it got it’s name in lights auspiciously on PubChem News. What's also nice is that I have had some degree of engagement with both headlines below (this post relates to the SureChem one)

As a PubChem submitting source chemicalize.org is unique because it is compiled from "user-selected" content. This does not mean anyone necessarily has a burning curiosity about each of the 2000-odd IUPAC structures chemicalize can extract from a patent, especially perhaps the common reagents, but, by definition, they were ”interested” enough (e.g. probably in the protein target and/or the disease indication) to actively retrieve and chemicalize that specific document. It thus becomes ”collectively crowd-sourced” content at rate of 15K user hits per week (and rising).

Via links from the announcement and those above you can quickly get a utility overview. What I can do here is a quick "slice and dice" of the current content. This is complementary to the interesting breakdown provided by the origin of the extraction queries. Of the many-fold categorizations we can do in PubChem the principle ones I go for as a first pass is to slice by patents, papers, activity, vendors and unique content (single source CIDs) plus some others that you can see below.

There are some technical caveats and complications here but I'll just pick up the salient ones.The first is that I do most of my analysis at the CID level. For chemicalise.org the SID:CID ratio is 300987: 297083 which at 1:013 means almost complete chemistry rule congruence for collapsing SIDs into CIDs.

Rule-of-five content at 71% is high (c.f. 59% for ChEMBL as predominantly active compounds from papers) which indicates the interest focus on drug-like space. However, the absence of a lower cut-off brings in a lot of common chemistry so to asses the more lead-like space I impose a higher Mw range of 250-800 which drops this to 38% (c.f. 58% for ChEMBL). Unsurprisingly, since patents are the major user-selected source by document origin, the structure overlap is also high at 69% for all sources (SureChem, IBM,SCRIPDB and Thomson Pharma) with 52% for SureChem being the largest patent-only source. Note that coverage of DrugBank is high (60%) but for ChEMBL is low(3%). One of the reason for the latter is that, unlike patents, there are very few med chem journals with open URLs that users can just run for extractions.

The analysis below divides up the chemicalize submissions by the number of sources (i.e. SIDs-per-CID)

There are two positive indications here. The first is that this source is 22% unique as defined by the formation of a single CID according to the PubChem Chem rules. The corollary is that 78% of the submissions are thus structurally confirmed by at least one other source. The important thing is that chemicalize generates its structures completely independently (it's possible some users could be converting 1000s of SMILES download from PubChem but this is unlikely!). Thus any CID structures with 2 or more independent SIDs are, on balance, more likely to be both "right" rather than both "wrong". Interestingly, of the 50 double-SID sources , 87% are one from chemicalize plus one of the four patent sources. As the SID count increases we get towards common chemistry. For example, approved drugs have, on average, 68 SIDs-per-CID and there is a chemicalize.org entry for aspirin (SID 137001131) which has 180 single-substance SIDs.

There is certainly value in being able to identify common chemistry in a document (e.g. drugs, metabolites and synthetic reagents) or a Wiki page. However, for the database users may have more interest in the less-common structures and the 60K unique ones in particular. The absolute numbers are modest compared to other sources but remember these have been actively "dug out" by chemicalize.org users and were PubChem -ve before that. We can illustrate this with a couple of examples. During the course of helping out the Sydney OSDD Malaria team. I was able to chemicalize one of their open lab book pages and pull down 14 structures of newly synthesized or on-order structures (below).

As we know this kind of research takes a while to get the full papers out, but pending this, chemicalize duly piped the six novel structures through to PubChem six months ago, as you can see below.

Thus, within a month or two of the structures being instantiated in the open lab book they were in PubChem for anyone, anywhere in the world, working on anything, to make an exact or similarity search connection maybe a year or more before a paper appears. I'll take a nano-credit for being (I think) the first person to throw that particular URL at chemicalize, but of course it could have been anyone else, or any other open lab book. One of these unique structures(CID 57515644) is a low nM lead for the antimalarial project. The source link is shown below.

Note here, even with the CID "Create Date: 2012-08-08" from PubChem the web page has updated in real time because I re-chemicalized it (just now) to re-check the PubChem matches.The last example will be a prospective one, using the GSK TB hits blogged about earlier this week (see backgound). You can see the results of chemicalising 777 TB actives from an open web page (below).

739 of these chemicalized but gave only 506 CID matches. Thus, potentially 233 erstwhile proprietary GSK screening collection structures should come through to PubChem via the next chemicalize submission, which I will add here in due course.

I am looking forward to BioIT World this April where I am organising a Drug Reporposing Workshop, supported by three excellent facullty companions. I also have a presentation in Track 11 and possbly a poster or two the abstracts for which are with collaborators at the moment. Once the show is over I hope to get these out on slideshare (maybe figshare as well)

The purpose of this post is not only to get the word around for the workshop but also to offer what might be called “generic” abstracts which I would be pleased to present well – just about anywhere (but cost-neutral at least). For the record the same goes for anything in my slideshare sets (or even the longer blog posts !) where I am first author but of course these are somewhat after the fact. The Boston trip is more or less covered for that week but Friday 12th could an option if the venue is in or close to Boston. Outside this any other options can considered (e.g. Sweden, UK or Germany) and interested parties are welcome to contact me directly. Note also the eventual presentations can be customised according to specific host interests. I may be able to add additional abstracts depending on the fate of manuscripts in the pipe. The current abstracts are pasted below.

******************************************************

Digging out public structures for repurposing candidates: Introducing non-competive intelligence.

Repurposing (REP) tendering calls by the Medical Research Council (MRC UK) in 2011 and by the National Center for Advancing Translational Sciences (NCATS, US) in 2012, included data for industry-provided clinical candidates, but without chemical structures. This presentation begins with tips and tricks for resolving company code names using public sources, including Google, to facilitate in silico REP research. Eventualy ~ 40 could be assigned PubChem CIDs, but ~30 remain blinded (see PMID: 23159359). Issues that will be described include a) retrieval ”noise” from code name synonyms and variants b) ad hoc disclosure timings c) obfuscation where blinding is maintined until the INN stage, d) lack of transparency via missing clinical trial results and blinded published data for structures e) options to improve name-to-structure disclosures and sharing. The concept of ”non-competive intelligence” is introduced for the general identifcation and data mining of the many 1000s of primary efficacy failiures with REP potential from all open sources, with a view to possible collaborations (i.e. differenciated from ”competive intelligence”). Examples will be shown of interesecting code names between clinicaltrials.gov, company portfolio sites, PubChem searches, and back-mapping structures to data from patents.

******************************************************

Closing the information gap between chemistry and biology.

Progress in the biomedical sciences is critically dependent on chemical information and the structure activity relationships (SAR) of bioactive molecules. This applies to drug discovery, pharmacology, chemical biology, metabolomics and other knowledge domains. However the entombing of much of this potentially linkable data within the text of patents, papers, abstracts and web pages has been a major barrier to progress. This presentation outlines trends in the public domain that are dramatically lowering these barriers. These include; PubChem approaching 50 mill, ChemSpider 30 mill, ChEMBL collating SAR for 0.8 mill structures linked to their targets from 50K medicinal chemistry papers, over 20 mill abstracts in PubMed, 650K PubChem BioAssays, open tropical disease data sets, momentum towards open access journals, proliferation of full-text patent sources and the recent SureChemOpen deposition bringing PubChem patent-extracted structures to over 14 mill. This presentation will also outline non-specialist practical options that enable anyone to explore joins across sources for their own research, particularly for establishing document-to-document and document-to-database links. These include the PubChem toolbox, protein targets in UniProt and PubChem BioAssay, chemicalize.org for text name-to-structure conversion, OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google. Combined use of these will be exemplified for selected protease inhibitors that can be joined between patents, papers, abstracts chemical database entries and drug target protein sequences.

******************************************************

BACE1 and BACE2: From drug target discovery to protein evolution and back

The Beta-site APP-cleaving enzyme 1 (BACE1) was confirmed as the long sought after beta-secretase and an Alzheimer’s disease drug target in 1999. However, the role of BACE2, published in 2000 proved elusive until a 2011 paper implicated it as a TMEM27 secretase controlling pancreatic beta-cell proliferation and, consequently, a new target for diabetes. These aspartyl proteases thus become the only pair of human drug target paralogs with 50% sequence identity, for completely separate major therapeutic indications, in different tissues, with the same enzyme mechanism, but whose initial target validation was separated by over a decade. In this new context, discerning their evolutionary history becomes particularly important because of the still incomplete picture of physiological functions for both human enzymes. Using new genome and transcript data an evolutional trajectory of BACE-like protein sequences can be traced back to the emergence of a Ur-BACE ancestral sequences over 0.7 billion years ago. These have 30- 40% identity to contemporary BACE1 and BACE2 that first appeared as duplicated paralogs in teleost fish. The complex pattern in basal metazoans suggest an early neurological role for the Ur-BACE but its presence in lineages with no APP or TMEM27 could have important implications for hitherto undetected substrates and functions of contempory BACE1 and BACE2. Thus, functional genomics on the Ur-BACE or post-duplication paralogs in model organisms could reveal important new insights. The story comes full circle with the recent functional de-orphanisation and early stage target validation of BACE2. Consequently, the 1000’s of BACE1 inhibitors are now joined by the first tranche of BACE2-specific inhibitors published in recent patents from Roche and other companies. The new availability of specific probe compounds for both enzymes opens up experimental options related to functional profiling from rodents and fish down to sea squirts and oysters. Resolving new mammalian data in terms of the evolutioniary history of both these proteases and their putative substrates becomes not only relevant to possible inhibition side effects but also for rationally deciding the optimal selectivity for AD or diabetes drug candidates.

NCATS repurposing compounds in PubChem: Part II

LACTB in WikiGenes

PubChem pips 9 million patent-extracted structures

Patent and PubChem mining the MMV390048 antimalarial

MRC 22 vs. NCATS 58 repurposing lists: similar but different

Kudos to PubChem and a look at the top-10 sources

The pearl of the Oyster genome is missing

Annotation "noise" for a Merck DPPIV inhibitor

PubChem patent sources Mw slice 'n dice

Clinical trial transparency and GSK code numbers

Oops - didn't mean to publish that: A follow-up

A 50th post: the quirky story of AZD5904

Getting your SAR data out there

Backstory on the NCATS and AZ/MRC structures paper

SureChem patent chemistry pushes PubChem above 46 million

Antimalarial target deconvolution: Part I

Antimalarial target deconvolution: Part II

First look at the GSK TB compounds

Chemicalize.org from ChemAxon in PubChem

Getting out and about: seminars to go