cansmi of molfile vs cansmi of cansmi

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

cansmi of molfile vs cansmi of cansmi

TJ O'Donnell
In building a database of pubchem and other publicly available structures,
I'm converting from molfile to isomeric cansmi.  In several cases (about 7 per thousand)
I get one cansmi from the molfile, but another when I ask for the cansmi of the
resulting cansmi.  For database integrity reasons, the cansmi(cansmi) = cansmi should
be always be true, so this is getting me into trouble for these structures.

Sometimes it is double-bond stereo flipping, sometimes an aromaticity problem.
It also seems to have something to do with symmetric structures.

I've attached an example sdf file with 20 structures, all of which show this behaviour.
I hope someone can take a look and see if I'm doing something odd or how I
might correct this.

Below is a sample babel session showing the problem on my linux system
with openbabel 2.2.3
I haven't tried windows or mac or other babel versions

Thanks,
TJ O'Donnell

tj@ubuntu:~/chord/openbabel$ babel -i sdf -o can <cantest.sdf  >cantest.smi
20 molecules converted
1 info messages 591 audit log messages
tj@ubuntu:~/chord/openbabel$ babel -i smi -o can <cantest.smi  >cancan.smi
20 molecules converted
6 info messages 646 audit log messages
tj@ubuntu:~/chord/openbabel$ diff -y can*.smi
[Cu].c1ccc2C3=NC4=NC(=NC5=NC(=NC6=NC(=NC(=N3)c2c1)c1ccccc61)c | [Cu].c1ccc2C3=NC(=NC4=NC(=NC5=NC(=NC6=NC(=N3)c3ccccc63)c3cccc
OC(=O)CCC(=O)c1ccc2C3C=CC=C4C=CC=C(c2c1)C34 10617      | OC(=O)CCC(=O)c1ccc2c(c1)c1cccc3cccc2c13 10617
[Cl-].C[NH+](C)CC[n]1c2ccccc2[n](C)c2ccccc2c1=O 9418      | [Cl-].C[NH+](C)CCn1c2ccccc2n(C)c2ccccc2c1=O 9418
[Cl-].C[N+](=C1/C=C/C(=C(c2ccccc2)c2ccc(cc2)N(C)C)C=C1)C      | [Cl-].C[N+](=C1\C=C\C(=C(c2ccccc2)c2ccc(cc2)N(C)C)C=C1)C
O=[C]1=[NH]C2=C(N1)[C](=O)=[NH][C](=O)=[N]2C 11804      | O=c1[nH]c2c([nH]1)c(=O)[nH]c(=O)n2C 11804
[O-][N+](=O)c1cc2ccc3cccc4ccc(c1)C2C34 13090      | [O-][N+](=O)c1cc2ccc3cccc4ccc(c1)c2c34 13090
C[C@@H]1CCCC1C 13197      | CC1CCC[C@H]1C 13197
Cl.NC(=N/C(=N/CCc1ccccc1)/N)N 13266      | Cl.NC(=N\C(=N\CCc1ccccc1)\N)N 13266
[O-][N+](=O)C1=CC=C2c3ccccc3C3=CC=CC1C23 13462      | [O-][N+](=O)c1ccc2c3ccccc3c3cccc1c23 13462
CCN(CC)c1ccc(cc1)C(=C1C=CC(=[N+](CC)CC)/C=C/1)c1ccccc1.[O-]S( | CCN(CC)c1ccc(cc1)C(=C\1C=CC(=[N+](CC)CC)\C=C1)c1ccccc1.[O-]S(
CCN(CC)c1ccc(cc1)C(=C1C=CC(=[N+](CC)CC)/C=C/1)c1ccccc1 12450 | CCN(CC)c1ccc(cc1)C(=C1\C=C\C(=[N+](CC)CC)C=C1)c1ccccc1 12450
NC(=N/N=C/c1ccc(o1)[N+](=O)[O-])N 13685      | NC(=N\N=C\c1ccc(o1)[N+](=O)[O-])N 13685
O=c1c2CC=CCc2[nH]c2cc3c(cc12)[nH]c1ccccc1c3=O 13976      | O=c1c2cc3[nH]c4ccccc4c(=O)c3cc2[nH]c2ccccc12 13976
CN(C)CC[n]1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14396      | CN(C)CCn1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14396
CN(C)CC[n]1c2ccccc2[nH]c2ccc(Cl)cc2c1=O 14397      | CN(C)CCn1c2ccccc2[nH]c2ccc(Cl)cc2c1=O 14397
CN(C)CC[n]1c2cc(C)ccc2[n](C)c2ccccc2c1=O 14405      | CN(C)CCn1c2cc(C)ccc2n(C)c2ccccc2c1=O 14405
CN(C)CCC[n]1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14409      | CN(C)CCCn1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14409
C[C@H]1CCC[C@H]1C 14498      | C[C@@H]1CCC[C@@H]1C 14498
c1ccc2[n](c1)cc[n]1ccccc21 14553      | c1ccc2n(c1)ccn1ccccc21 14553
NC(=N/C(=N/c1ccc2ccccc2c1)/N)N 14623      | NC(=N\C(=N\c1ccc2ccccc2c1)\N)N 14623
tj@ubuntu:~/chord/openbabel$ babel --version
No output file or format spec!
Open Babel 2.2.3 -- Mar  7 2010 -- 03:47:42
Usage: babel [-i<input-type>] <name> [-o<output-type>] <name>
Try  -H option for more information.


------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

cantest.sdf (109K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: cansmi of molfile vs cansmi of cansmi

Craig James-2
On 7/21/10 5:39 PM, TJ O'Donnell wrote:

> In building a database of pubchem and other publicly available structures,
> I'm converting from molfile to isomeric cansmi. In several cases (about
> 7 per thousand)
> I get one cansmi from the molfile, but another when I ask for the cansmi
> of the
> resulting cansmi. For database integrity reasons, the cansmi(cansmi) =
> cansmi should
> be always be true, so this is getting me into trouble for these structures.
>
> Sometimes it is double-bond stereo flipping, sometimes an aromaticity
> problem.
> It also seems to have something to do with symmetric structures.
>
> I've attached an example sdf file with 20 structures, all of which show
> this behaviour.
> I hope someone can take a look and see if I'm doing something odd or how I
> might correct this.

Hi TJ,

These are known problems.  Mostly it's lack of time to fix them.  Tim made some progress recently, but the aromaticity (which always involves nitrogen) is still a problem.

Craig


>
> Below is a sample babel session showing the problem on my linux system
> with openbabel 2.2.3
> I haven't tried windows or mac or other babel versions
>
> Thanks,
> TJ O'Donnell
>
> tj@ubuntu:~/chord/openbabel$ babel -i sdf -o can <cantest.sdf >cantest.smi
> 20 molecules converted
> 1 info messages 591 audit log messages
> tj@ubuntu:~/chord/openbabel$ babel -i smi -o can <cantest.smi >cancan.smi
> 20 molecules converted
> 6 info messages 646 audit log messages
> tj@ubuntu:~/chord/openbabel$ diff -y can*.smi
> [Cu].c1ccc2C3=NC4=NC(=NC5=NC(=NC6=NC(=NC(=N3)c2c1)c1ccccc61)c |
> [Cu].c1ccc2C3=NC(=NC4=NC(=NC5=NC(=NC6=NC(=N3)c3ccccc63)c3cccc
> OC(=O)CCC(=O)c1ccc2C3C=CC=C4C=CC=C(c2c1)C34 10617 |
> OC(=O)CCC(=O)c1ccc2c(c1)c1cccc3cccc2c13 10617
> [Cl-].C[NH+](C)CC[n]1c2ccccc2[n](C)c2ccccc2c1=O 9418 |
> [Cl-].C[NH+](C)CCn1c2ccccc2n(C)c2ccccc2c1=O 9418
> [Cl-].C[N+](=C1/C=C/C(=C(c2ccccc2)c2ccc(cc2)N(C)C)C=C1)C |
> [Cl-].C[N+](=C1\C=C\C(=C(c2ccccc2)c2ccc(cc2)N(C)C)C=C1)C
> O=[C]1=[NH]C2=C(N1)[C](=O)=[NH][C](=O)=[N]2C 11804 |
> O=c1[nH]c2c([nH]1)c(=O)[nH]c(=O)n2C 11804
> [O-][N+](=O)c1cc2ccc3cccc4ccc(c1)C2C34 13090 |
> [O-][N+](=O)c1cc2ccc3cccc4ccc(c1)c2c34 13090
> C[C@@H]1CCCC1C 13197 | CC1CCC[C@H]1C 13197
> Cl.NC(=N/C(=N/CCc1ccccc1)/N)N 13266 | Cl.NC(=N\C(=N\CCc1ccccc1)\N)N 13266
> [O-][N+](=O)C1=CC=C2c3ccccc3C3=CC=CC1C23 13462 |
> [O-][N+](=O)c1ccc2c3ccccc3c3cccc1c23 13462
> CCN(CC)c1ccc(cc1)C(=C1C=CC(=[N+](CC)CC)/C=C/1)c1ccccc1.[O-]S( |
> CCN(CC)c1ccc(cc1)C(=C\1C=CC(=[N+](CC)CC)\C=C1)c1ccccc1.[O-]S(
> CCN(CC)c1ccc(cc1)C(=C1C=CC(=[N+](CC)CC)/C=C/1)c1ccccc1 12450 |
> CCN(CC)c1ccc(cc1)C(=C1\C=C\C(=[N+](CC)CC)C=C1)c1ccccc1 12450
> NC(=N/N=C/c1ccc(o1)[N+](=O)[O-])N 13685 |
> NC(=N\N=C\c1ccc(o1)[N+](=O)[O-])N 13685
> O=c1c2CC=CCc2[nH]c2cc3c(cc12)[nH]c1ccccc1c3=O 13976 |
> O=c1c2cc3[nH]c4ccccc4c(=O)c3cc2[nH]c2ccccc12 13976
> CN(C)CC[n]1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14396 |
> CN(C)CCn1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14396
> CN(C)CC[n]1c2ccccc2[nH]c2ccc(Cl)cc2c1=O 14397 |
> CN(C)CCn1c2ccccc2[nH]c2ccc(Cl)cc2c1=O 14397
> CN(C)CC[n]1c2cc(C)ccc2[n](C)c2ccccc2c1=O 14405 |
> CN(C)CCn1c2cc(C)ccc2n(C)c2ccccc2c1=O 14405
> CN(C)CCC[n]1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14409 |
> CN(C)CCCn1c2ccc(Cl)cc2[nH]c2ccccc2c1=O 14409
> C[C@H]1CCC[C@H]1C 14498 | C[C@@H]1CCC[C@@H]1C 14498
> c1ccc2[n](c1)cc[n]1ccccc21 14553 | c1ccc2n(c1)ccn1ccccc21 14553
> NC(=N/C(=N/c1ccc2ccccc2c1)/N)N 14623 | NC(=N\C(=N\c1ccc2ccccc2c1)\N)N 14623
> tj@ubuntu:~/chord/openbabel$ babel --version
> No output file or format spec!
> Open Babel 2.2.3 -- Mar 7 2010 -- 03:47:42
> Usage: babel [-i<input-type>] <name> [-o<output-type>] <name>
> Try -H option for more information.
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
>
>
>
> _______________________________________________
> OpenBabel-discuss mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss


------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: cansmi of molfile vs cansmi of cansmi

Geoffrey Hutchison
In reply to this post by TJ O'Donnell
> I've attached an example sdf file with 20 structures, all of which show this behaviour.
> I hope someone can take a look and see if I'm doing something odd or how I
> might correct this.

As Craig said, you're not doing anything odd. Breaking ties is tough, and this type of testing really taxes the aromaticity detection as well.

I just tried your file (as well as several others) with the 2.3 trunk, and found that 44/48 cases passed. That still leaves 4 failures:
* One funny canonicalization bug with Cu phthalocyanine.
* Two aromaticity failures with fused rings containing multiple nitrogens
* One silly case with un-needed '[n]' sequences.

The good news is this -- it's been added as a unit test, so if you find more failures, just send me more files.

I can definitely tackle case #2 and case #3 in the near future. I'd need Craig to look at #1, but if we can reduce his test cases, we all profit. :-)
  [Cu].c1ccc2C3=NC(=NC4=NC(=NC5=NC(=NC6=NC(=N3)c3ccccc63)c3ccccc53)c3ccccc43)c2c1
  [Cu].c1ccc2C3=NC4=NC(=NC5=NC(=NC6=NC(=NC(=N3)c2c1)c1ccccc61)c1ccccc51)c1ccccc41

Thanks for the bug report TJ -- please send us any other failures you encounter.
-Geoff
------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss