how to use the new ECFP<n> fingerprints?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

how to use the new ECFP<n> fingerprints?

Andrew Dalke
I'm trying to understand how to use the ECFP fingerprints added in Open Babel 2.4.

They generate fingerprints where the length is a function of the number of heavy atoms. I want a fixed length binary fingerprint so I can compare two fingerprints using the normal binary Tanimoto. (This is for chemfp.) It doesn't seem possible. It seems like what's missing is something like a hash method to turn these per-atom values into something that fits in with how the other fingerprints work.


Here's an example which shows how the length varies:

>>> import openbabel as ob
>>> import pybel
>>>
>>> mol1 = pybel.readstring("smi", "C").OBMol
>>> mol3 = pybel.readstring("smi", "CCC").OBMol
>>> mol9 = pybel.readstring("smi", "C"*9).OBMol
>>> mol_info = [("mol1", mol1), ("mol3", mol3), ("mol9", mol9)]
>>>
>>> fptype = ob.OBFingerprint.FindFingerprint("ECFP0")
>>>
>>> # Show that the size is a function of length
... for name, mol in mol_info:
...     fp = ob.vectorUnsignedInt()
...     fptype.GetFingerprint(mol, fp)
...     print("%s: %r" % (name, list(fp)))
...
True
mol1: [1526808443]
True
mol3: [3405580958, 3405580958, 3756301279]
True
mol9: [3405580958, 3405580958, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279]

I understand what it's showing. Each heavy atom has its own value in the list, and the list is sorted to give a canonical ordering.

More specifically, the [CH4] generates the characteristic value "1526808443", the two "[CH3]-" generates the characteristic value "3405580958", and the "-[CH2]-" generates the characteristic value 3756301279.


However, the non-ECFP fingerprints all generate a constant size, and parts of Open Babel will break with a variable length size.

For example, the fast search indexing in fingerprint.cpp:322  FastSearchIndex::Add() assumes the fingerprint vector returned from  GetFingerprint() will be constant, where headwords = vectors.size(). If you try to generate a .fs file using  "-ofs -xfECFP0" then it will work, but the similarity search will fail with "Difficulty reading from index".


Is there an Open Babel function to compare two of these variable-length fingerprints? It looks like a count-based Tanimoto is needed, so mol3 and mol9 have a similarity of (2+1)/(2+7) = 3/9 = 1/3.

Is there any way to turn this into a useful fixed-length fingerprint? I tried to generate FPS output using "-ofps -xfECFP0" but the fingerprint content was empty.


I could zero-pad small fingerprints, but it's not really possible to compare, say, "C", "O", and "CO" as the corresponding values of [X, 0], [Y, 0], and either [X, Y] or [Y, X] won't give the right comparison scores.

The current folding method also isn't really useful for larger fingerprints. There's an nBits parameter of GetFingerprint():


... for name, mol in mol_info:
...     fp = ob.vectorUnsignedInt()
...     fptype.GetFingerprint(mol, fp, 128)
...     print("%s (fold 128): %r" % (name, list(fp)))
...
True
mol1 (fold 128): [1526808443]
True
mol3 (fold 128): [3405580958, 3405580958, 3756301279]
True
mol9 (fold 128): [3757939679, 3757939679, 3756301279, 3756301279]

The underlying code in fingerecfp.cpp implements this by calling Fold(). However, if you want to be able to compare the post-folded fingerprints, then this only works if the initial positions are globally invariant for the given characteristic. But in the ECFP case the initial position depends on the other features in the molecule, because the fingerprints are sorted.



(Also, there's a bug where the nBits doesn't work until the number of bits is at least twice as long as that value:

>>> for i in range(4, 10):
...   mol = pybel.readstring("smi", "C"*i).OBMol
...   fp = ob.vectorUnsignedInt()
...   fptype.GetFingerprint(mol, fp, 128)
...   print("%s: %r" % (i, list(fp)))
...
True
4: [4293393407, 3757939679, 4026359518, 4286507510]
True
5: [4293393407, 4294433791, 4026359518, 3942468319, 4293835775]
True
6: [4293393407, 4294433791, 4294433791, 3942472702, 4026359518, 4293835775]
True
7: [4293393407, 4294433791, 4294433791, 4294433791, 4026359518, 3942468319, 4293835775]
True
8: [4294441983, 4294959103, 4294967295, 4294966271]
True
9: [4294441983, 4294433791, 4294959103, 4294958079]
)

To make a long email short, it feels like there should be an entirely different function than folding to turn these list of per-atom ECFP values into the type of fingerprint that the rest of Open Babel (and of chemfp) can use.



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to use the new ECFP<n> fingerprints?

Noel O'Boyle
Administrator
The ECFP is new to Open Babel and hasn't been sorted out properly.
Geoff's been on to me to look into it, but it's way down my list at
the moment. So in short, I agree, and encourage a prospective user to
step up and look into it.

On 29 March 2017 at 04:53, Andrew Dalke <[hidden email]> wrote:

> I'm trying to understand how to use the ECFP fingerprints added in Open Babel 2.4.
>
> They generate fingerprints where the length is a function of the number of heavy atoms. I want a fixed length binary fingerprint so I can compare two fingerprints using the normal binary Tanimoto. (This is for chemfp.) It doesn't seem possible. It seems like what's missing is something like a hash method to turn these per-atom values into something that fits in with how the other fingerprints work.
>
>
> Here's an example which shows how the length varies:
>
>>>> import openbabel as ob
>>>> import pybel
>>>>
>>>> mol1 = pybel.readstring("smi", "C").OBMol
>>>> mol3 = pybel.readstring("smi", "CCC").OBMol
>>>> mol9 = pybel.readstring("smi", "C"*9).OBMol
>>>> mol_info = [("mol1", mol1), ("mol3", mol3), ("mol9", mol9)]
>>>>
>>>> fptype = ob.OBFingerprint.FindFingerprint("ECFP0")
>>>>
>>>> # Show that the size is a function of length
> ... for name, mol in mol_info:
> ...     fp = ob.vectorUnsignedInt()
> ...     fptype.GetFingerprint(mol, fp)
> ...     print("%s: %r" % (name, list(fp)))
> ...
> True
> mol1: [1526808443]
> True
> mol3: [3405580958, 3405580958, 3756301279]
> True
> mol9: [3405580958, 3405580958, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279]
>
> I understand what it's showing. Each heavy atom has its own value in the list, and the list is sorted to give a canonical ordering.
>
> More specifically, the [CH4] generates the characteristic value "1526808443", the two "[CH3]-" generates the characteristic value "3405580958", and the "-[CH2]-" generates the characteristic value 3756301279.
>
>
> However, the non-ECFP fingerprints all generate a constant size, and parts of Open Babel will break with a variable length size.
>
> For example, the fast search indexing in fingerprint.cpp:322  FastSearchIndex::Add() assumes the fingerprint vector returned from  GetFingerprint() will be constant, where headwords = vectors.size(). If you try to generate a .fs file using  "-ofs -xfECFP0" then it will work, but the similarity search will fail with "Difficulty reading from index".
>
>
> Is there an Open Babel function to compare two of these variable-length fingerprints? It looks like a count-based Tanimoto is needed, so mol3 and mol9 have a similarity of (2+1)/(2+7) = 3/9 = 1/3.
>
> Is there any way to turn this into a useful fixed-length fingerprint? I tried to generate FPS output using "-ofps -xfECFP0" but the fingerprint content was empty.
>
>
> I could zero-pad small fingerprints, but it's not really possible to compare, say, "C", "O", and "CO" as the corresponding values of [X, 0], [Y, 0], and either [X, Y] or [Y, X] won't give the right comparison scores.
>
> The current folding method also isn't really useful for larger fingerprints. There's an nBits parameter of GetFingerprint():
>
>
> ... for name, mol in mol_info:
> ...     fp = ob.vectorUnsignedInt()
> ...     fptype.GetFingerprint(mol, fp, 128)
> ...     print("%s (fold 128): %r" % (name, list(fp)))
> ...
> True
> mol1 (fold 128): [1526808443]
> True
> mol3 (fold 128): [3405580958, 3405580958, 3756301279]
> True
> mol9 (fold 128): [3757939679, 3757939679, 3756301279, 3756301279]
>
> The underlying code in fingerecfp.cpp implements this by calling Fold(). However, if you want to be able to compare the post-folded fingerprints, then this only works if the initial positions are globally invariant for the given characteristic. But in the ECFP case the initial position depends on the other features in the molecule, because the fingerprints are sorted.
>
>
>
> (Also, there's a bug where the nBits doesn't work until the number of bits is at least twice as long as that value:
>
>>>> for i in range(4, 10):
> ...   mol = pybel.readstring("smi", "C"*i).OBMol
> ...   fp = ob.vectorUnsignedInt()
> ...   fptype.GetFingerprint(mol, fp, 128)
> ...   print("%s: %r" % (i, list(fp)))
> ...
> True
> 4: [4293393407, 3757939679, 4026359518, 4286507510]
> True
> 5: [4293393407, 4294433791, 4026359518, 3942468319, 4293835775]
> True
> 6: [4293393407, 4294433791, 4294433791, 3942472702, 4026359518, 4293835775]
> True
> 7: [4293393407, 4294433791, 4294433791, 4294433791, 4026359518, 3942468319, 4293835775]
> True
> 8: [4294441983, 4294959103, 4294967295, 4294966271]
> True
> 9: [4294441983, 4294433791, 4294959103, 4294958079]
> )
>
> To make a long email short, it feels like there should be an entirely different function than folding to turn these list of per-atom ECFP values into the type of fingerprint that the rest of Open Babel (and of chemfp) can use.
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> OpenBabel-discuss mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to use the new ECFP<n> fingerprints?

Andrew Dalke
On Mar 29, 2017, at 10:30, Noel O'Boyle <[hidden email]> wrote:
> The ECFP is new to Open Babel and hasn't been sorted out properly.
> Geoff's been on to me to look into it, but it's way down my list at
> the moment. So in short, I agree, and encourage a prospective user to
> step up and look into it.

Thanks Noel.

I can work on some of that, that is, a sparse->dense function, perhaps along the lines of what the RDKit does.

There are a few things I don't know how to do, and would like advice:

  1) I don't know how to pass in configuration information with the current API

The equivalent RDKit code takes in the bit length and the number of bits per hash, with default values. In Open Babel, the current API expects callers pass in an nBits of 0 to get the default size. I can't change that without breaking backwards compatibility, which I fully agree is a no-go.

I can change the code so that passing in nBits=0 generates (say) a a 1024 bit fingerprint.

However, then there would then be no way to get the list of values, which the current ECFP function returns, unless I do something like use "nBits=-1" (or "nBits=1") as a special-flag.

There's also no way to pass in the number of bits per hash. For that I can use a default value.

Any suggestions?

  2) I have no good way to evaluate the length and density values.

I could make a guess on a decent length and density, or perhaps copy RDKit's algorithm directly and use its defaults.

Better would be if I explore a bit of parameter space, like 512, 1024, 2048 bits and 4-8 bits per value.

But I have no data sets which I could use in that evaluation.

Noel, as a co-author of "Comparing structural fingerprints using a literature-based similarity benchmark", do you have any recommendations for how I can do an evaluation?


A different option is that I can put the sparse->dense code into chemfp, where I can more easily control the parameters, label it "experimental" (which, I've found out, doesn't prevent people from using it), and get some feedback from that, which might inform future Open Babel development.


Cheers,
                                Andrew
                                [hidden email]



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to use the new ECFP<n> fingerprints?

Andrew Dalke
On Mar 29, 2017, at 12:47, Andrew Dalke <[hidden email]> wrote:
> But I have no data sets which I could use in that evaluation.
  ...
> A different option is that I can put the sparse->dense code into chemfp, where I can more easily control the parameters, label it "experimental" (which, I've found out, doesn't prevent people from using it), and get some feedback from that, which might inform future Open Babel development.

Yet another option is that I can develop a stand-alone program to generate the ECFP<n> fingerprints as an FPS file, and hope that someone here is interested enough in doing an evaluation and/or interested in providing feedback on my algorithm.

(The only unusual thing about the algorithm is that I decided to use xorshift128+ as the PRNG. RDKit uses a Mersenne Twister, but that has a large initialization overhead, and the sparse->dense algorithm does a *lot* of PRNG initialization. Using a MT adds 30% overhead.)

I've placed a copy of said stand-alone program at

  http://dalkescientific.com/obecfp2fps.py

It requires Python 2.7 and Open Babel >=3.4.



                                Andrew
                                [hidden email]



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to use the new ECFP<n> fingerprints?

Noel O'Boyle
Administrator
(This discussion should really be on the dev list.)

I just want to note in passing that I would consider 4096 as a bare
minimum for use of RDKit's fingerprint and then only if there is a
diskspace or memory constraint.  For my own work, I use 16384.

Also, if the API needs changing to enable reasonable functionality
now's the time to discuss it. The best place to propose and discuss
such a change is over at
https://github.com/openbabel/enhancement-proposals.

- Noel

On 30 March 2017 at 12:28, Andrew Dalke <[hidden email]> wrote:

> On Mar 29, 2017, at 12:47, Andrew Dalke <[hidden email]> wrote:
>> But I have no data sets which I could use in that evaluation.
>   ...
>> A different option is that I can put the sparse->dense code into chemfp, where I can more easily control the parameters, label it "experimental" (which, I've found out, doesn't prevent people from using it), and get some feedback from that, which might inform future Open Babel development.
>
> Yet another option is that I can develop a stand-alone program to generate the ECFP<n> fingerprints as an FPS file, and hope that someone here is interested enough in doing an evaluation and/or interested in providing feedback on my algorithm.
>
> (The only unusual thing about the algorithm is that I decided to use xorshift128+ as the PRNG. RDKit uses a Mersenne Twister, but that has a large initialization overhead, and the sparse->dense algorithm does a *lot* of PRNG initialization. Using a MT adds 30% overhead.)
>
> I've placed a copy of said stand-alone program at
>
>   http://dalkescientific.com/obecfp2fps.py
>
> It requires Python 2.7 and Open Babel >=3.4.
>
>
>
>                                 Andrew
>                                 [hidden email]
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> OpenBabel-discuss mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to use the new ECFP<n> fingerprints?

Andrew Dalke
On Mar 30, 2017, at 13:45, Noel O'Boyle <[hidden email]> wrote:
> (This discussion should really be on the dev list.)

I thought my user hat was firmly in place in the last post. I'm trying to figure out how to use the new Open Babel fingerprints for the upcoming chemfp release(s). I suspect others here might be interested in using OB's ECFP<n> fingerprints as boolean fingerprints, even without changes to Open Babel.

I hope that experience will feed into future OB development, and agree that dev is a better place for that discussion.

> https://github.com/openbabel/enhancement-proposals.

By hook or by crook the world is going to force me to learn git... or to figure out how to use http://hg-git.github.io/ . :)

                                Andrew
                                [hidden email]


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Loading...