Canonicalization

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Canonicalization

Leonid Chepelev
Hello,

I was converting SDF molecules into canonical SMILES, but in my case,
I would like to keep the correspondence between the (order of
appearance) index of the SDF file atoms and the (order of appearance)
index of the atoms in the newly created SMILES string.

Would you say this is possible without first making the conversion and
then doing graph isomorphism? Since I am doing large numbers of
conversions, efficiency is of great importance and this proposition is
not efficient at all, it would seem.

I know this is probably very simple, but I have not gone too much into
detail of the inner workings of OpenBabel, so it's difficult for me to
solve currently. I appreciate any advice anyone here may offer.

Regards,

Leonid Chepelev

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Chris Morley-3
Canonical SMILES has the atoms in the same order whatever the input
order (which is the reason it is used).

Because the atoms can be in any order in the SDF, ordinary
(non-canonical) SMILES with the same order might be have to be very
strange. For formaldehyde HCHO if the atoms in the SDF were in the
order HOHC, the best SMILES I can find is:
  [H]1.O=2.[H]3.C123
but there may be something better. Anyway, OpenBabel does not have the
option to produce forms like this.

However, in the reverse direction, I think the atom order in an output
SDF will be the same as an input SMILES. So to get SMILES and SDF
identically ordered you could first convert to SMILES and then convert
back to SDF.

Chris

Leonid Chepelev wrote:

> Hello,
>
> I was converting SDF molecules into canonical SMILES, but in my case,
> I would like to keep the correspondence between the (order of
> appearance) index of the SDF file atoms and the (order of appearance)
> index of the atoms in the newly created SMILES string.
>
> Would you say this is possible without first making the conversion and
> then doing graph isomorphism? Since I am doing large numbers of
> conversions, efficiency is of great importance and this proposition is
> not efficient at all, it would seem.
>
> I know this is probably very simple, but I have not gone too much into
> detail of the inner workings of OpenBabel, so it's difficult for me to
> solve currently. I appreciate any advice anyone here may offer.
>



------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Geoffrey Hutchison
In reply to this post by Leonid Chepelev
> appearance) index of the SDF file atoms and the (order of appearance)
> index of the atoms in the newly created SMILES string.

This would no longer be a canonical SMILES -- it would be a "regular" SMILES, where there may be several SMILES strings with different ordering.

There are ways to create a canonical SMILES and then re-order the SDF appropriately. If you're curious, I can write up code which would do that. (I don't think it's an option for SDF output yet.)

Hope that helps,
-Geoff
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Leonid Chepelev
Thank you Chris, though I must say that canonical smiles definitely
re-orders the atom to create an algorithm-unique ordering(except for a
couple of buggy cases with tautomers and aromatic systems), though I
believe that in some "regular" SMILES, the ordering is often the same
as input SDF.

Geoffrey, thank you for your answer. Yes, I would really like that
re-ordering of the SDF atoms - I believe an option like that already
exists for non-canonical smiles: to output the atom coordinates in the
order they appear in canonical form. So, I was wondering, does the
following command already do the trick, and if so, what's the point of
.can output?

babel -isdf yournoncanonicalsdf.sdf -osmi yoursmilesfile.smi -xcx

The -xc part should output canonical smiles, and -xx gives me the
X-coordinate of the atoms in the order they appear in the canonical
SMILES string. Is this usage safe/correct? Is the canonicalization
algorithm in -osmi -xc the same as in -ocan ? Because if that's the
case, no extra work needs to be done.

Thank you all so much for your attention and help.

Leonid Chepelev

On Sat, Feb 20, 2010 at 9:14 PM, Geoffrey Hutchison
<[hidden email]> wrote:
>> appearance) index of the SDF file atoms and the (order of appearance)
>> index of the atoms in the newly created SMILES string.
>
> This would no longer be a canonical SMILES -- it would be a "regular" SMILES, where there may be several SMILES strings with different ordering.
>
> There are ways to create a canonical SMILES and then re-order the SDF appropriately. If you're curious, I can write up code which would do that. (I don't think it's an option for SDF output yet.)
>
> Hope that helps,
> -Geoff

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Geoffrey Hutchison
> babel -isdf yournoncanonicalsdf.sdf -osmi yoursmilesfile.smi -xcx
>
> The -xc part should output canonical smiles, and -xx gives me the
> X-coordinate of the atoms in the order they appear in the canonical
> SMILES string. Is this usage safe/correct? Is the canonicalization
> algorithm in -osmi -xc the same as in -ocan ? Because if that's the
> case, no extra work needs to be done.

If all you want is the XYZ coordinates, then you can certainly use that method. There is no difference between using "-xc" to indicate that you want canonical SMILES and using the so-called "can" format. There's just more than one way to do it.

Hope that helps,
-Geoff
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Leonid Chepelev
Well, as I said, I really only want to preserve the mapping of the
order of the atoms as they appear in SDF and atoms as they appear in
the canonical smiles string. It would seem that using the output
options I specified, as a post-conversion step I could simply look for
the order of the atoms having the indicated combination of the
reported X and Y coordinates.

For some reason, the -xx option does not output the Z coordinate even
though the coordinate is non-zero in my input. Would it be easy for
someone to change that, so that Z-coordinate is printed? Because if my
original molecule is planar (e.g. benzene) and is aligned in the xz
plane, the X and Y coordinates may not identify the benzene ring atoms
uniquely, and the process I've just outlined won't work. Of course, it
would beeven better if someone actually added an option in -x such
that the actual original atom positions were reported instead of the X
and Y coordinates.

Thank you so much, Geoff!

On Sun, Feb 21, 2010 at 7:37 AM, Geoffrey Hutchison
<[hidden email]> wrote:

>> babel -isdf yournoncanonicalsdf.sdf -osmi yoursmilesfile.smi -xcx
>>
>> The -xc part should output canonical smiles, and -xx gives me the
>> X-coordinate of the atoms in the order they appear in the canonical
>> SMILES string. Is this usage safe/correct? Is the canonicalization
>> algorithm in -osmi -xc the same as in -ocan ? Because if that's the
>> case, no extra work needs to be done.
>
> If all you want is the XYZ coordinates, then you can certainly use that method. There is no difference between using "-xc" to indicate that you want canonical SMILES and using the so-called "can" format. There's just more than one way to do it.
>
> Hope that helps,
> -Geoff

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Craig James-2
In reply to this post by Leonid Chepelev
Leonid,

> I was converting SDF molecules into canonical SMILES, but in my case,
> I would like to keep the correspondence between the (order of
> appearance) index of the SDF file atoms and the (order of appearance)
> index of the atoms in the newly created SMILES string.
>
> Would you say this is possible without first making the conversion and
> then doing graph isomorphism? Since I am doing large numbers of
> conversions, efficiency is of great importance and this proposition is
> not efficient at all, it would seem.
>
> I know this is probably very simple, but I have not gone too much into
> detail of the inner workings of OpenBabel, so it's difficult for me to
> solve currently. I appreciate any advice anyone here may offer.

As Chris said, it is not practical to write SMILES with the atoms in a specific order.

I suggest you use the more "traditional" way.  You write out the canonical SMILES, and you also write out an atom-mapping string that correlates the canonical order to the original order.  For example:

  CCO  ==>  OCC 2,1,0

  c1c(O)cccc1 ==> Oc1ccccc1  2,1,0,6,5,4,3

The canonical atom order is already stored as a sting.  I haven't compiled this example, but it shows the idea:

      if (mol.HasData("Canonical Atom Order")) {
        vector<string> vs;
        string canorder = mol.GetData("Canonical Atom Order")->GetValue();
        ofs << " " << canorder << endl;
        }
      }

Once you have this string, you can use it to build a simple array that maps the canonical order back to the original order, or vice versa.

Craig

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Craig James-2
In reply to this post by Leonid Chepelev
Leonid Chepelev wrote:

> For some reason, the -xx option does not output the Z coordinate even
> though the coordinate is non-zero in my input.

Probably because I was in a hurry when I wrote it...

> Would it be easy for
> someone to change that, so that Z-coordinate is printed?

If you're in a hurry, just modify line 3958 of smilesformat.cpp:

    ofs << atom->GetX() << "," << atom->GetY();

to

    ofs << atom->GetX() << "," << atom->GetY() << "," << atom->GetZ();

But don't check that back in.  It really needs to be a different option, like "z" instead of "x".

Craig


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Leonid Chepelev
In reply to this post by Craig James-2
Thank you for both of your suggestions, Craig, both are informative
and will do the job. I'm in a hurry to get a result quickly now, so I
think I'll go with your suggestion to modify the appropriate line to
make babel print out the Z coordinate for now, and will work on your
other suggestion when I will be generating a clean and efficient
version of my code (soon).

But in any case, all of this solves my problem really well.

Thank you very much for your help!

On Sun, Feb 21, 2010 at 10:19 AM, Craig A. James <[hidden email]> wrote:

> Leonid,
>
>> I was converting SDF molecules into canonical SMILES, but in my case,
>> I would like to keep the correspondence between the (order of
>> appearance) index of the SDF file atoms and the (order of appearance)
>> index of the atoms in the newly created SMILES string.
>>
>> Would you say this is possible without first making the conversion and
>> then doing graph isomorphism? Since I am doing large numbers of
>> conversions, efficiency is of great importance and this proposition is
>> not efficient at all, it would seem.
>>
>> I know this is probably very simple, but I have not gone too much into
>> detail of the inner workings of OpenBabel, so it's difficult for me to
>> solve currently. I appreciate any advice anyone here may offer.
>
> As Chris said, it is not practical to write SMILES with the atoms in a
> specific order.
>
> I suggest you use the more "traditional" way.  You write out the canonical
> SMILES, and you also write out an atom-mapping string that correlates the
> canonical order to the original order.  For example:
>
>  CCO  ==>  OCC 2,1,0
>
>  c1c(O)cccc1 ==> Oc1ccccc1  2,1,0,6,5,4,3
>
> The canonical atom order is already stored as a sting.  I haven't compiled
> this example, but it shows the idea:
>
>     if (mol.HasData("Canonical Atom Order")) {
>       vector<string> vs;
>       string canorder = mol.GetData("Canonical Atom Order")->GetValue();
>       ofs << " " << canorder << endl;
>       }
>     }
>
> Once you have this string, you can use it to build a simple array that maps
> the canonical order back to the original order, or vice versa.
>
> Craig
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Canonicalization

Leonid Chepelev
In reply to this post by Craig James-2
> As Chris said, it is not practical to write SMILES with the atoms in a
> specific order.
>
> I suggest you use the more "traditional" way.  You write out the canonical
> SMILES, and you also write out an atom-mapping string that correlates the
> canonical order to the original order.

Oh, and I forgot to add, just to clarify, that I never wanted to write
SMILES with atoms in a specific order. All I wanted was that
atom-mapping string, that is, to know that atom number x in my input
sdf is atom number y in my output canonical smiles. I am sorry that I
wasn't clear on my problem, it certainly led to a little bit of
misunderstanding.

Again, thank you to everyone who has answered!

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss