Updates: the unmapped code names have been appended to the end of this post for crowdsourcing, a subsequenct post has addded in the AZ/MRC compound list and we now have a paperout on the combined results.
Interest in the NCATS set of 58 compounds for academic repurposing has recently been invigorated by two slide presentations at the Philadelphia ACS meeting, blog posts and some personal e-mail contacts. It turns out I was neither alone in pointing out the problems associated with project tendering for blinded clinical candidates nor in expending some effort in trying to map the names to structures (the other groups are mentioned in the slides from Antony Williams and the blog from Sean Ekins). Suitably inspired, I managed to track down three more name > strucs, as shown in the image hits below.
While OSRA did well on some previous images it only picked up a ring or two for these three new cases so I actually had to sketch them. The "orphan" provenance of a Taiwanese chemical supplier for the JNJ39393406 structure is interesting and somewhat unusual (did they pick it from SciFinder perhaps?). I could find no corroboration for the SMILES output from the sketcher (C1=CC(=CC2C1OC(O2)(F)F)NC3=N[N](C(=N3)C4=CC=NC=C4)CCC(N(C)C)=O) because this had no exact matches or high similarities in PubChem or SureChemOpen. However, the information supplied by Janssen to NCATS specifies the compound as a "positive allosteric modulator at the nicotinic α7 receptor" and the closest match in PubChem is CID 24850110 (below).
Not only does this look like a plausible analog of the vendor structure but it also has a SureChemOpen exact match to US-20090253691-A1 from Janssen, where the abstract quotes; "invention particularly relates to positive allosteric modulators of nicotinic acetylcholine receptors". Low and behold browsing the PDF revealed the vendor structure as compound 33 on page 40 (below).
This is listed with a pEC50 of 6.2 as mid-potency withing the range covered in the large SAR table on page 74. The interesting corollary here is that SureChemOpen has not yet completed their image extraction back-fill so this is likely to be dropped-in eventually (see patent mining section below). So there we have it, ....possibly. If anyone from Janssen is prepared to corroborate the identity of JNJ39393406 I would be pleased to acknowledge this in an update. Note that it is arguably more important for them to do this if the vendor structure to-code name assignment is wrong, rather than right ! (see Live-chemical-structure-blogging). Below I have included my revised identification list (now with thee more structures than the previous post) as brief provenance descriptions with PubChem CID links.
To consolidate this update I searched each of the CIDs against SureChemOpen. This was done with the canonical SMILES string, starting with exact matches but, if these were negative, backing-off to a similarity search. The links presented below are generally the oldest and presumed first publication. Note that for the three older compounds with INNs they have been named as prior art and mixtures in 100s of patents. It should be possible to use date cutting in SureChemOpen to find the earliest filings as IUPACs or image-extracted structures (but I can't be bothered just now). The other thing I have not done is check each publication to see if the presumed assignee, target, SAR data etc, tally with those in the NCATS PDFs, but the ones I glanced at seemed to fit (anyone interested in details can contact me). Note that 30 out of 33 patent whacks is not bad going and indicates, at least for this set, most structures have been exemplified and sucessfully extracted rather than being specified only in a Markush nest. The results are listed below.
The utility of patent mapping (n.b. there are additional open patent links via PubChem sources for some of these entries) in the context of in silico and/or in vitro investigations on these compounds is at least threefold. Firstly, some may include substantially larger SAR data sets (e.g. IC50 tables) than were eventually included in journal articles. Secondly, they may include other unpublished biological and/or ADMET data. Thirdly, analogs that are very useful (essential even ?) for a range of comparative investigations, will not only have their synthesis routes described, but also, one might assume, in cases where the NCATS proposals have been approved, that the companies concerned could donate them.
We can compare the current small-molecule efforts as outlined in the Collaborations-to-get-the-ncats-library-of-industry-provided-reagents post, where it is reported that Chris Lipinski found 36 (via SciFinder, Thomson Reuters Integrity and web searches) and Tudor Oprea et al., 41 (via IBM US Patents database, Google and publications). This leaves me trailing in third place with 33 structures but note that no commercial databases were used and some relevant publications were not on the Göteborg Universíty Library subscription list. I did receive some useful comments on my original post including the Google images trick.
There are a lot of interesting corollaries to all of this but I shall just introduce some brief ones here (they also depend on intersecting the three sets to determine concordance). The first is it would be useful to know what the sources were for the three or more mappings that I "missed" but were presumably explicitly curated in SciFinder and/or Thomson. The reason is that these products, comprehensive as they are, cannot (I presume) disclose proprietary mappings even via a company CDAs because their content is licensed to many users (~ 0.3 million globally?). Thus any code-name-to-struc they capture has to have a public primary source (including subscription publications) even if this is just a meeting poster or slide image that never got Google crawled. The only possible exception I can think of is where CAS may be in possession of a code-name-to-struc as a necessary prerequisite for an INN and/or USAN application, but presumably it cannot disclose to users until the WHO PDF has appeared. The second corollary is code-name-to-struc occurance in patents. This is unlikely to be in first-filings because the identity of the eventual clinical candidate (that they may not have selected or given a development code to at filing time anyway) is exactly what applicants generally try to obfuscate but also exemplify and claim as an IUPAC. Code names can thus only be back-mapped to structures in the early filings (as in the list above). I have come across code names with their associated IUPACs in patents but these tend to be associated with later filings of formulations or combinations and not the first disclosure of a code-name-to-struc.
Last but not least, here are the sharing bits:
1) The links above should be live (but you will need the free SureChemOpen sign-up for the patents)
2) The complete Excel sheet is available for download at http://figshare.com/articles/NCats_Compounds_with_identifications/92850
3) You can now "View my collection, "32 NCATS CIDs" from NCBI". If you open these up there is a lot of information in the consitutive filters on the right hand side, including 15 active in assays and 15 available from vendors. Note also you can save this to your own MyNCBI, perform a range of analyses with the PubChem toolbox and download the structures as a set of SD files or any other format.
4) As a test, I have submitted one new synonym to PubChem in the form of AZD1656 inSID 136946384. I may do more but I am awaiting imminent enhancements to their submission system and I would also prefer to eventually do this collaboratively, so the mapping provenances can be independently corroborated (perhaps even by the companies concerned?) before they become enshrined in the PubChem synonym compilations.
Addendum 25 Aug. Those small-molecule codes I have been unable to map or remain equivocal are pasted below (but note other parties may have dug some of them out). If anyone can resolve any of these from declarable sources (but not necessarily be personally held to their provenance, unless they were the project leader or portfolio manager!) they are most welcome to post such new information (e.g. even just a pointer to an image) and thereby be attributed for extending the mappings. Ideally they could add a comment to this blog post but any open channel would do.
ABT-639
LY2828360
SSR150106
AZD2423
JNJ-39269646
PF-05190457
BMS-820132
ABT-288
PF-04995274 publication links are PubMed -ve
LY2590443
SD-7300/SC-81490 referenced in PMID 20726512 but points to SC-78080/SD-2590
AZD1236 possibly in PMID: 21624491 but TTD-only mapping to CID 56603698
BMS-830216
AZD5904 (TTD-only mapping to CID 177992 )
SAR103168
CP-601927, CP-601,927 possibly in PMID: 21594972
SD-6010 (SC-84250) assuming SC-842 possibly in PMID: 17672879
AZD7268
AVE0847
PF-05019702 (PRA-27) = WAY-257027
AZD9056 possibly in PMID: 21440623
LY2245461
SSR97225
**************************************************
Interest in the NCATS set of 58 compounds for academic repurposing has recently been invigorated by two slide presentations at the Philadelphia ACS meeting, blog posts and some personal e-mail contacts. It turns out I was neither alone in pointing out the problems associated with project tendering for blinded clinical candidates nor in expending some effort in trying to map the names to structures (the other groups are mentioned in the slides from Antony Williams and the blog from Sean Ekins). Suitably inspired, I managed to track down three more name > strucs, as shown in the image hits below.
While OSRA did well on some previous images it only picked up a ring or two for these three new cases so I actually had to sketch them. The "orphan" provenance of a Taiwanese chemical supplier for the JNJ39393406 structure is interesting and somewhat unusual (did they pick it from SciFinder perhaps?). I could find no corroboration for the SMILES output from the sketcher (C1=CC(=CC2C1OC(O2)(F)F)NC3=N[N](C(=N3)C4=CC=NC=C4)CCC(N(C)C)=O) because this had no exact matches or high similarities in PubChem or SureChemOpen. However, the information supplied by Janssen to NCATS specifies the compound as a "positive allosteric modulator at the nicotinic α7 receptor" and the closest match in PubChem is CID 24850110 (below).
Not only does this look like a plausible analog of the vendor structure but it also has a SureChemOpen exact match to US-20090253691-A1 from Janssen, where the abstract quotes; "invention particularly relates to positive allosteric modulators of nicotinic acetylcholine receptors". Low and behold browsing the PDF revealed the vendor structure as compound 33 on page 40 (below).
This is listed with a pEC50 of 6.2 as mid-potency withing the range covered in the large SAR table on page 74. The interesting corollary here is that SureChemOpen has not yet completed their image extraction back-fill so this is likely to be dropped-in eventually (see patent mining section below). So there we have it, ....possibly. If anyone from Janssen is prepared to corroborate the identity of JNJ39393406 I would be pleased to acknowledge this in an update. Note that it is arguably more important for them to do this if the vendor structure to-code name assignment is wrong, rather than right ! (see Live-chemical-structure-blogging). Below I have included my revised identification list (now with thee more structures than the previous post) as brief provenance descriptions with PubChem CID links.
To consolidate this update I searched each of the CIDs against SureChemOpen. This was done with the canonical SMILES string, starting with exact matches but, if these were negative, backing-off to a similarity search. The links presented below are generally the oldest and presumed first publication. Note that for the three older compounds with INNs they have been named as prior art and mixtures in 100s of patents. It should be possible to use date cutting in SureChemOpen to find the earliest filings as IUPACs or image-extracted structures (but I can't be bothered just now). The other thing I have not done is check each publication to see if the presumed assignee, target, SAR data etc, tally with those in the NCATS PDFs, but the ones I glanced at seemed to fit (anyone interested in details can contact me). Note that 30 out of 33 patent whacks is not bad going and indicates, at least for this set, most structures have been exemplified and sucessfully extracted rather than being specified only in a Markush nest. The results are listed below.
The utility of patent mapping (n.b. there are additional open patent links via PubChem sources for some of these entries) in the context of in silico and/or in vitro investigations on these compounds is at least threefold. Firstly, some may include substantially larger SAR data sets (e.g. IC50 tables) than were eventually included in journal articles. Secondly, they may include other unpublished biological and/or ADMET data. Thirdly, analogs that are very useful (essential even ?) for a range of comparative investigations, will not only have their synthesis routes described, but also, one might assume, in cases where the NCATS proposals have been approved, that the companies concerned could donate them.
We can compare the current small-molecule efforts as outlined in the Collaborations-to-get-the-ncats-library-of-industry-provided-reagents post, where it is reported that Chris Lipinski found 36 (via SciFinder, Thomson Reuters Integrity and web searches) and Tudor Oprea et al., 41 (via IBM US Patents database, Google and publications). This leaves me trailing in third place with 33 structures but note that no commercial databases were used and some relevant publications were not on the Göteborg Universíty Library subscription list. I did receive some useful comments on my original post including the Google images trick.
There are a lot of interesting corollaries to all of this but I shall just introduce some brief ones here (they also depend on intersecting the three sets to determine concordance). The first is it would be useful to know what the sources were for the three or more mappings that I "missed" but were presumably explicitly curated in SciFinder and/or Thomson. The reason is that these products, comprehensive as they are, cannot (I presume) disclose proprietary mappings even via a company CDAs because their content is licensed to many users (~ 0.3 million globally?). Thus any code-name-to-struc they capture has to have a public primary source (including subscription publications) even if this is just a meeting poster or slide image that never got Google crawled. The only possible exception I can think of is where CAS may be in possession of a code-name-to-struc as a necessary prerequisite for an INN and/or USAN application, but presumably it cannot disclose to users until the WHO PDF has appeared. The second corollary is code-name-to-struc occurance in patents. This is unlikely to be in first-filings because the identity of the eventual clinical candidate (that they may not have selected or given a development code to at filing time anyway) is exactly what applicants generally try to obfuscate but also exemplify and claim as an IUPAC. Code names can thus only be back-mapped to structures in the early filings (as in the list above). I have come across code names with their associated IUPACs in patents but these tend to be associated with later filings of formulations or combinations and not the first disclosure of a code-name-to-struc.
Last but not least, here are the sharing bits:
1) The links above should be live (but you will need the free SureChemOpen sign-up for the patents)
2) The complete Excel sheet is available for download at http://figshare.com/articles/NCats_Compounds_with_identifications/92850
3) You can now "View my collection, "32 NCATS CIDs" from NCBI". If you open these up there is a lot of information in the consitutive filters on the right hand side, including 15 active in assays and 15 available from vendors. Note also you can save this to your own MyNCBI, perform a range of analyses with the PubChem toolbox and download the structures as a set of SD files or any other format.
4) As a test, I have submitted one new synonym to PubChem in the form of AZD1656 inSID 136946384. I may do more but I am awaiting imminent enhancements to their submission system and I would also prefer to eventually do this collaboratively, so the mapping provenances can be independently corroborated (perhaps even by the companies concerned?) before they become enshrined in the PubChem synonym compilations.
Addendum 25 Aug. Those small-molecule codes I have been unable to map or remain equivocal are pasted below (but note other parties may have dug some of them out). If anyone can resolve any of these from declarable sources (but not necessarily be personally held to their provenance, unless they were the project leader or portfolio manager!) they are most welcome to post such new information (e.g. even just a pointer to an image) and thereby be attributed for extending the mappings. Ideally they could add a comment to this blog post but any open channel would do.
ABT-639
LY2828360
SSR150106
AZD2423
JNJ-39269646
PF-05190457
BMS-820132
ABT-288
PF-04995274 publication links are PubMed -ve
LY2590443
SD-7300/SC-81490 referenced in PMID 20726512 but points to SC-78080/SD-2590
AZD1236 possibly in PMID: 21624491 but TTD-only mapping to CID 56603698
BMS-830216
AZD5904 (TTD-only mapping to CID 177992 )
SAR103168
CP-601927, CP-601,927 possibly in PMID: 21594972
SD-6010 (SC-84250) assuming SC-842 possibly in PMID: 17672879
AZD7268
AVE0847
PF-05019702 (PRA-27) = WAY-257027
AZD9056 possibly in PMID: 21440623
LY2245461
SSR97225