Replacement of aromatic.txt SMARTS patterns with switch statement

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Replacement of aromatic.txt SMARTS patterns with switch statement

Noel O'Boyle
Administrator
Hi there,

Here's a heads-up on some work I've been prototyping.

The aromatic atom typer currently uses SMARTS patterns in aromatic.txt
to assign max/min values of pi electrons. A more efficient approach is
to simultaneously match against all the SMARTS patterns rather than
one at a time, and well, to avoid using SMARTS at all.

I've attached a Python prototype that shows the general idea - see the
function getMinMax (the calls to IsAromatic will have to be removed,
but are unavoidable here; the "elif"s will become a switch statement;I
need to think some more about explicit hydrogens). To my mind, the use
of a direct lookup is as clear, if not clearer, than using SMARTS
patterns.

I note that the existing tests don't hit all of the patterns, and
while I can find molecules in ChEMBL that hit almost all of the
patterns, I'm not sure whether I can find ones where the corresponding
atom turns out to be aromatic in the end. I have a feeling this is
because the patterns were added in response to dodgy smiles (e.g.
using n instead of [nH]) which were reported or found by Geoff.

Regards,
- Noel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

aromatic.py (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replacement of aromatic.txt SMARTS patterns with switch statement

Geoffrey Hutchison
I think it's a great idea. Chris Morley had recommended similar concepts in terms of implicit valence.

Yes, many of the stranger SMARTS patterns here are for "dodgy" SMILES that should retain aromaticity. It's possible, perhaps to set some level of "if it was initially flagged as an aromatic atom, be more lenient" rules in the code.

I'd like to continue the concept of an annual release, so in the meantime, I think experiments are welcome.

-Geoff

On Fri, Jan 27, 2017 at 3:03 AM, Noel O'Boyle <[hidden email]> wrote:
Hi there,

Here's a heads-up on some work I've been prototyping.

The aromatic atom typer currently uses SMARTS patterns in aromatic.txt
to assign max/min values of pi electrons. A more efficient approach is
to simultaneously match against all the SMARTS patterns rather than
one at a time, and well, to avoid using SMARTS at all.

I've attached a Python prototype that shows the general idea - see the
function getMinMax (the calls to IsAromatic will have to be removed,
but are unavoidable here; the "elif"s will become a switch statement;I
need to think some more about explicit hydrogens). To my mind, the use
of a direct lookup is as clear, if not clearer, than using SMARTS
patterns.

I note that the existing tests don't hit all of the patterns, and
while I can find molecules in ChEMBL that hit almost all of the
patterns, I'm not sure whether I can find ones where the corresponding
atom turns out to be aromatic in the end. I have a feeling this is
because the patterns were added in response to dodgy smiles (e.g.
using n instead of [nH]) which were reported or found by Geoff.

Regards,
- Noel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel
Reply | Threaded
Open this post in threaded view
|

Re: Replacement of aromatic.txt SMARTS patterns with switch statement

Geoffrey Hutchison
I should mention on that note, that a collaboration with Carnegie Mellon students produced a parallel implementation of Kekulization using the Eigen3 matrix library. They also wrote a CUDA implementation that was modestly faster.

It hasn't been ported back to Open Babel yet, but I'll leave the basic code (MIT license) here:

Anyone interested should let me know..

Cheers,
-Geoff

On Fri, Jan 27, 2017 at 5:13 PM, Geoffrey Hutchison <[hidden email]> wrote:
I think it's a great idea. Chris Morley had recommended similar concepts in terms of implicit valence.

Yes, many of the stranger SMARTS patterns here are for "dodgy" SMILES that should retain aromaticity. It's possible, perhaps to set some level of "if it was initially flagged as an aromatic atom, be more lenient" rules in the code.

I'd like to continue the concept of an annual release, so in the meantime, I think experiments are welcome.

-Geoff

On Fri, Jan 27, 2017 at 3:03 AM, Noel O'Boyle <[hidden email]> wrote:
Hi there,

Here's a heads-up on some work I've been prototyping.

The aromatic atom typer currently uses SMARTS patterns in aromatic.txt
to assign max/min values of pi electrons. A more efficient approach is
to simultaneously match against all the SMARTS patterns rather than
one at a time, and well, to avoid using SMARTS at all.

I've attached a Python prototype that shows the general idea - see the
function getMinMax (the calls to IsAromatic will have to be removed,
but are unavoidable here; the "elif"s will become a switch statement;I
need to think some more about explicit hydrogens). To my mind, the use
of a direct lookup is as clear, if not clearer, than using SMARTS
patterns.

I note that the existing tests don't hit all of the patterns, and
while I can find molecules in ChEMBL that hit almost all of the
patterns, I'm not sure whether I can find ones where the corresponding
atom turns out to be aromatic in the end. I have a feeling this is
because the patterns were added in response to dodgy smiles (e.g.
using n instead of [nH]) which were reported or found by Geoff.

Regards,
- Noel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel
Reply | Threaded
Open this post in threaded view
|

Re: Replacement of aromatic.txt SMARTS patterns with switch statement

Noel O'Boyle
Administrator
Great. One question I've run into is what was the intention of the D2
etc in the SMARTS patterns. Was it the number of heavy atom neighbors?
As written, it's the number of explicit nbrs in the graph, which is
complicated by the fact that OB's SMILES parser currently adds an
explicit H for H's inside square brackets, e.g. [CH-]. So if the
patterns were developed by testing on SMILES, then the intended
D-value is somewhat unclear for patterns that typically match atoms
with hydrogens but which are written as implicit hydrogens. Confused?
I am too. :-)

- Noel

On 27 January 2017 at 22:17, Geoffrey Hutchison
<[hidden email]> wrote:

> I should mention on that note, that a collaboration with Carnegie Mellon
> students produced a parallel implementation of Kekulization using the Eigen3
> matrix library. They also wrote a CUDA implementation that was modestly
> faster.
>
> It hasn't been ported back to Open Babel yet, but I'll leave the basic code
> (MIT license) here:
> https://github.com/NarainKrishnamurthy/chemposer
>
> Anyone interested should let me know..
>
> Cheers,
> -Geoff
>
> On Fri, Jan 27, 2017 at 5:13 PM, Geoffrey Hutchison
> <[hidden email]> wrote:
>>
>> I think it's a great idea. Chris Morley had recommended similar concepts
>> in terms of implicit valence.
>>
>> Yes, many of the stranger SMARTS patterns here are for "dodgy" SMILES that
>> should retain aromaticity. It's possible, perhaps to set some level of "if
>> it was initially flagged as an aromatic atom, be more lenient" rules in the
>> code.
>>
>> I'd like to continue the concept of an annual release, so in the meantime,
>> I think experiments are welcome.
>>
>> -Geoff
>>
>> On Fri, Jan 27, 2017 at 3:03 AM, Noel O'Boyle <[hidden email]>
>> wrote:
>>>
>>> Hi there,
>>>
>>> Here's a heads-up on some work I've been prototyping.
>>>
>>> The aromatic atom typer currently uses SMARTS patterns in aromatic.txt
>>> to assign max/min values of pi electrons. A more efficient approach is
>>> to simultaneously match against all the SMARTS patterns rather than
>>> one at a time, and well, to avoid using SMARTS at all.
>>>
>>> I've attached a Python prototype that shows the general idea - see the
>>> function getMinMax (the calls to IsAromatic will have to be removed,
>>> but are unavoidable here; the "elif"s will become a switch statement;I
>>> need to think some more about explicit hydrogens). To my mind, the use
>>> of a direct lookup is as clear, if not clearer, than using SMARTS
>>> patterns.
>>>
>>> I note that the existing tests don't hit all of the patterns, and
>>> while I can find molecules in ChEMBL that hit almost all of the
>>> patterns, I'm not sure whether I can find ones where the corresponding
>>> atom turns out to be aromatic in the end. I have a feeling this is
>>> because the patterns were added in response to dodgy smiles (e.g.
>>> using n instead of [nH]) which were reported or found by Geoff.
>>>
>>> Regards,
>>> - Noel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> OpenBabel-Devel mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>>>
>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel
Reply | Threaded
Open this post in threaded view
|

Re: Replacement of aromatic.txt SMARTS patterns with switch statement

Noel O'Boyle
Administrator
Maybe am overthinking. If it doesn't change the final output (as
regards aromatic SMILES) on ChEMBL, maybe it's not worth worrying
about now.

- Noel

On 30 January 2017 at 18:31, Noel O'Boyle <[hidden email]> wrote:

> Great. One question I've run into is what was the intention of the D2
> etc in the SMARTS patterns. Was it the number of heavy atom neighbors?
> As written, it's the number of explicit nbrs in the graph, which is
> complicated by the fact that OB's SMILES parser currently adds an
> explicit H for H's inside square brackets, e.g. [CH-]. So if the
> patterns were developed by testing on SMILES, then the intended
> D-value is somewhat unclear for patterns that typically match atoms
> with hydrogens but which are written as implicit hydrogens. Confused?
> I am too. :-)
>
> - Noel
>
> On 27 January 2017 at 22:17, Geoffrey Hutchison
> <[hidden email]> wrote:
>> I should mention on that note, that a collaboration with Carnegie Mellon
>> students produced a parallel implementation of Kekulization using the Eigen3
>> matrix library. They also wrote a CUDA implementation that was modestly
>> faster.
>>
>> It hasn't been ported back to Open Babel yet, but I'll leave the basic code
>> (MIT license) here:
>> https://github.com/NarainKrishnamurthy/chemposer
>>
>> Anyone interested should let me know..
>>
>> Cheers,
>> -Geoff
>>
>> On Fri, Jan 27, 2017 at 5:13 PM, Geoffrey Hutchison
>> <[hidden email]> wrote:
>>>
>>> I think it's a great idea. Chris Morley had recommended similar concepts
>>> in terms of implicit valence.
>>>
>>> Yes, many of the stranger SMARTS patterns here are for "dodgy" SMILES that
>>> should retain aromaticity. It's possible, perhaps to set some level of "if
>>> it was initially flagged as an aromatic atom, be more lenient" rules in the
>>> code.
>>>
>>> I'd like to continue the concept of an annual release, so in the meantime,
>>> I think experiments are welcome.
>>>
>>> -Geoff
>>>
>>> On Fri, Jan 27, 2017 at 3:03 AM, Noel O'Boyle <[hidden email]>
>>> wrote:
>>>>
>>>> Hi there,
>>>>
>>>> Here's a heads-up on some work I've been prototyping.
>>>>
>>>> The aromatic atom typer currently uses SMARTS patterns in aromatic.txt
>>>> to assign max/min values of pi electrons. A more efficient approach is
>>>> to simultaneously match against all the SMARTS patterns rather than
>>>> one at a time, and well, to avoid using SMARTS at all.
>>>>
>>>> I've attached a Python prototype that shows the general idea - see the
>>>> function getMinMax (the calls to IsAromatic will have to be removed,
>>>> but are unavoidable here; the "elif"s will become a switch statement;I
>>>> need to think some more about explicit hydrogens). To my mind, the use
>>>> of a direct lookup is as clear, if not clearer, than using SMARTS
>>>> patterns.
>>>>
>>>> I note that the existing tests don't hit all of the patterns, and
>>>> while I can find molecules in ChEMBL that hit almost all of the
>>>> patterns, I'm not sure whether I can find ones where the corresponding
>>>> atom turns out to be aromatic in the end. I have a feeling this is
>>>> because the patterns were added in response to dodgy smiles (e.g.
>>>> using n instead of [nH]) which were reported or found by Geoff.
>>>>
>>>> Regards,
>>>> - Noel
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> OpenBabel-Devel mailing list
>>>> [hidden email]
>>>> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>>>>
>>>
>>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel
Reply | Threaded
Open this post in threaded view
|

Re: Replacement of aromatic.txt SMARTS patterns with switch statement

Noel O'Boyle
Administrator
Ok - am getting somewhere now. I've confirmed that there is a problem
with the current codebase and the use of 'D', e.g. for protonated
imidazole (as in histidine in vivo), two different answers are found
depending on whether hydrogens are explicit or not:

C:\Users\noel>obabel -:Cc1[nH]c[nH+]c1 -osmi
Cc1[nH]c[nH+]c1

C:\Users\noel>obabel -:Cc1[nH]c[nH+]c1 -omol -d | obabel -imol -osmi
CC1=C[NH+]=CN1

My plan is to alter the current matcher such that results on SMILES
from ChEMBL, PubChem and eMolecules are unchanged (it's just minor
tweaks for a few of the charged patterns, e.g. D3 might become D2 and
one H). Personally, I'd also like to remove any "patterns" that aren't
triggered by (aromatic atoms in) molecules in any of these databases,
on the basis that it's better to have a set of patterns that we know
are correct (and all covered by testcases) and exclude the few extra
that may or may not be as intended. However, I won't do this unless
you agree.

Regards,
- Noel


On 30 January 2017 at 22:07, Noel O'Boyle <[hidden email]> wrote:

> Maybe am overthinking. If it doesn't change the final output (as
> regards aromatic SMILES) on ChEMBL, maybe it's not worth worrying
> about now.
>
> - Noel
>
> On 30 January 2017 at 18:31, Noel O'Boyle <[hidden email]> wrote:
>> Great. One question I've run into is what was the intention of the D2
>> etc in the SMARTS patterns. Was it the number of heavy atom neighbors?
>> As written, it's the number of explicit nbrs in the graph, which is
>> complicated by the fact that OB's SMILES parser currently adds an
>> explicit H for H's inside square brackets, e.g. [CH-]. So if the
>> patterns were developed by testing on SMILES, then the intended
>> D-value is somewhat unclear for patterns that typically match atoms
>> with hydrogens but which are written as implicit hydrogens. Confused?
>> I am too. :-)
>>
>> - Noel
>>
>> On 27 January 2017 at 22:17, Geoffrey Hutchison
>> <[hidden email]> wrote:
>>> I should mention on that note, that a collaboration with Carnegie Mellon
>>> students produced a parallel implementation of Kekulization using the Eigen3
>>> matrix library. They also wrote a CUDA implementation that was modestly
>>> faster.
>>>
>>> It hasn't been ported back to Open Babel yet, but I'll leave the basic code
>>> (MIT license) here:
>>> https://github.com/NarainKrishnamurthy/chemposer
>>>
>>> Anyone interested should let me know..
>>>
>>> Cheers,
>>> -Geoff
>>>
>>> On Fri, Jan 27, 2017 at 5:13 PM, Geoffrey Hutchison
>>> <[hidden email]> wrote:
>>>>
>>>> I think it's a great idea. Chris Morley had recommended similar concepts
>>>> in terms of implicit valence.
>>>>
>>>> Yes, many of the stranger SMARTS patterns here are for "dodgy" SMILES that
>>>> should retain aromaticity. It's possible, perhaps to set some level of "if
>>>> it was initially flagged as an aromatic atom, be more lenient" rules in the
>>>> code.
>>>>
>>>> I'd like to continue the concept of an annual release, so in the meantime,
>>>> I think experiments are welcome.
>>>>
>>>> -Geoff
>>>>
>>>> On Fri, Jan 27, 2017 at 3:03 AM, Noel O'Boyle <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Hi there,
>>>>>
>>>>> Here's a heads-up on some work I've been prototyping.
>>>>>
>>>>> The aromatic atom typer currently uses SMARTS patterns in aromatic.txt
>>>>> to assign max/min values of pi electrons. A more efficient approach is
>>>>> to simultaneously match against all the SMARTS patterns rather than
>>>>> one at a time, and well, to avoid using SMARTS at all.
>>>>>
>>>>> I've attached a Python prototype that shows the general idea - see the
>>>>> function getMinMax (the calls to IsAromatic will have to be removed,
>>>>> but are unavoidable here; the "elif"s will become a switch statement;I
>>>>> need to think some more about explicit hydrogens). To my mind, the use
>>>>> of a direct lookup is as clear, if not clearer, than using SMARTS
>>>>> patterns.
>>>>>
>>>>> I note that the existing tests don't hit all of the patterns, and
>>>>> while I can find molecules in ChEMBL that hit almost all of the
>>>>> patterns, I'm not sure whether I can find ones where the corresponding
>>>>> atom turns out to be aromatic in the end. I have a feeling this is
>>>>> because the patterns were added in response to dodgy smiles (e.g.
>>>>> using n instead of [nH]) which were reported or found by Geoff.
>>>>>
>>>>> Regards,
>>>>> - Noel
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> OpenBabel-Devel mailing list
>>>>> [hidden email]
>>>>> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>>>>>
>>>>
>>>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel
Reply | Threaded
Open this post in threaded view
|

Re: Replacement of aromatic.txt SMARTS patterns with switch statement

Geoff Hutchison
> Personally, I'd also like to remove any "patterns" that aren't
> triggered by (aromatic atoms in) molecules in any of these databases,
> on the basis that it's better to have a set of patterns that we know
> are correct (and all covered by test cases)

I’d be fine with some pruning. I’ve been making a pass through the next version of the PQR data set, and I’m seeing weird broken N-atom aromatic systems with the current development version.

-Geoff
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel
Reply | Threaded
Open this post in threaded view
|

Re: Replacement of aromatic.txt SMARTS patterns with switch statement

Noel O'Boyle
Administrator
Would it be okay to do that as part of a separate pull request? i.e.
if there are no other concerns, could you merge it as is. The easiest
way to do the pruning is to comment out relevant code one at a time,
and seeing whether the results change (for the worse). This will take
some time, but it would be useful to have this code merged, so that
the other changes related to the aromatic typer (global state) can be
merged together.

One thing also to think about is whether we would consider supporting
alternative aromatic models going forward. John Mayfield (nee May) has
recently described the Daylight aromaticity model on the OpenSMILES
list, and implementing it is simply a question of using a different
set of switch statements (as far as I can tell).

- Noel

On 10 February 2017 at 18:18, Geoffrey Hutchison
<[hidden email]> wrote:
>> Personally, I'd also like to remove any "patterns" that aren't
>> triggered by (aromatic atoms in) molecules in any of these databases,
>> on the basis that it's better to have a set of patterns that we know
>> are correct (and all covered by test cases)
>
> I’d be fine with some pruning. I’ve been making a pass through the next version of the PQR data set, and I’m seeing weird broken N-atom aromatic systems with the current development version.
>
> -Geoff

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-Devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel