Tanimoto (Tc) calculation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Tanimoto (Tc) calculation

Rob Soe
Hi all,
 
I have two lists of molecules. The first list contains testing molecules (test.sdf) and the second contains training molecules (train.sdf).
 
I would like to compare each test molecule to all the training molecule and calculate a corresponding Tanimoto similarity score. (I implemented it in my code and it is super slow as it is O(n^2) ). I use GetFingerprint() and Tanimoto() functions for such purpose. After the comparison, I picked up the kth most similar molecules to the test molecule and predict something for the test molecule. I am trying to make things a bit faster.
 
So I tried the following babel command:
 babel  test_mol.smi  train_mols.sdf -ofpt
But the problem is that I have to manually replace each testing molecule smile information for every time I run the command. In other words, the script will only compare the first molecule in the 'test_mol.smi" to all the molecules in 'train_mols.sdf'. So even if I have a list of testing molecules in the 'test_mol.smi', the script will not compare them to the training molecules except the first one in the list.
 
Is there a way or trick I can use so that I can compare all my testing molecules to the training molecules and get a list of Tc scores?
 
 
Thanks so much for your help!
Rob
 
 
 
 
 
 

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Tanimoto (Tc) calculation

Noel O'Boyle
Administrator
On 24 June 2010 23:37, Rob Soe <[hidden email]> wrote:

> Hi all,
>
> I have two lists of molecules. The first list contains testing molecules
> (test.sdf) and the second contains training molecules (train.sdf).
>
> I would like to compare each test molecule to all the training molecule and
> calculate a corresponding Tanimoto similarity score. (I implemented it in my
> code and it is super slow as it is O(n^2) ). I use GetFingerprint() and
> Tanimoto() functions for such purpose. After the comparison, I picked up the
> kth most similar molecules to the test molecule and predict something for
> the test molecule. I am trying to make things a bit faster.

There is no way to speed this up - if it's O(N^2) it's O(N^2).

> So I tried the following babel command:
>
>  babel  test_mol.smi  train_mols.sdf -ofpt
>
> But the problem is that I have to manually replace each testing molecule
> smile information for every time I run the command. In other words, the
> script will only compare the first molecule in the 'test_mol.smi" to all the
> molecules in 'train_mols.sdf'. So even if I have a list of testing molecules
> in the 'test_mol.smi', the script will not compare them to the training
> molecules except the first one in the list.
>
> Is there a way or trick I can use so that I can compare all my testing
> molecules to the training molecules and get a list of Tc scores?
>
>
> Thanks so much for your help!
> Rob
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> OpenBabel-discuss mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
>
>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Tanimoto (Tc) calculation

Igor Filippov

> > I have two lists of molecules. The first list contains testing molecules
> > (test.sdf) and the second contains training molecules (train.sdf).
> >
> > I would like to compare each test molecule to all the training molecule and
> > calculate a corresponding Tanimoto similarity score. (I implemented it in my
> > code and it is super slow as it is O(n^2) ). I use GetFingerprint() and
> > Tanimoto() functions for such purpose. After the comparison, I picked up the
> > kth most similar molecules to the test molecule and predict something for
> > the test molecule. I am trying to make things a bit faster.
>
> There is no way to speed this up - if it's O(N^2) it's O(N^2).
>

There are clever ways to pre-screen your set if you are looking for
molecules above a certain Tanimoto threshold from the other set of
molecules. See for example:
An Intersection Inequality Sharper than the Tanimoto Triangle Inequality
for Efficiently Searching Large Databases
Pierre Baldi and Daniel S. Hirschberg
J. Chem. Inf. Model., 2009, 49 (8), pp 1866–1870
Publication Date (Web): July 14, 2009 (Article)
DOI: 10.1021/ci900133j


If all you want is a list of all Tanimoto coefficients between two sets
then it's indeed O(N^2), I don't see any way around it.
Why would you want such an all-encompassing list though?
It's all how you formulate the problem...

Best regards,
Igor


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Tanimoto (Tc) calculation

Chris Morley-3
In reply to this post by Rob Soe
On 24/06/2010 23:37, Rob Soe wrote:

> Hi all,
> I have two lists of molecules. The first list contains testing molecules
> (test.sdf) and the second contains training molecules (train.sdf).
> I would like to compare each test molecule to all the training molecule
> and calculate a corresponding Tanimoto similarity score. (I implemented
> it in my code and it is super slow as it is O(n^2) ). I use
> GetFingerprint() and Tanimoto() functions for such purpose. After the
> comparison, I picked up the kth most similar molecules to the test
> molecule and predict something for the test molecule. I am trying to
> make things a bit faster.
> So I tried the following babel command:
>
>   babel  test_mol.smi  train_mols.sdf -ofpt
>
> But the problem is that I have to manually replace each testing molecule
> smile information for every time I run the command. In other words, the
> script will only compare the first molecule in the 'test_mol.smi" to all
> the molecules in 'train_mols.sdf'. So even if I have a list of testing
> molecules in the 'test_mol.smi', the script will not compare them to the
> training molecules except the first one in the list.
> Is there a way or trick I can use so that I can compare all my testing
> molecules to the training molecules and get a list of Tc scores?
> Thanks so much for your help!

I think you are going to have to use some kind of programming - Python
scripting or even C++ - to do things that the command line doesn't
provide. And working with two sets of molecules is something it does
not currently do.

If you are interested in only the closest matches to a test molecule,
the following may be a cleaner way of doing it:
Make a fast search index from test.sdf
   babel test.sdf -ofs
For a given test molecule, find, say, the 10 molecules with the
largest Tanimotos with it
   babel test.fs -S test_mol.sdf -aat10 result.smi
The second 'a' adds the Tanimoto to the result molecule's title.

You are still going to replace the test molecule or its file name. If
there were not too many you can split test.sdf
   babel test.sdf test_mol.sdf -m
will put each molecule into a different file. Maybe you could then
iterate using shell or batch scripting.

This isn't any help to you, but it happens that I am currently writing
code to prepare and N by N matrix of Tanimoto coefficients (on the way
to selecting a diverse set of molecules). As others have said, in
making this matrix you can't beat O(N2) but you can make each Tanimoto
calculation faster. Normally each Tanimoto requires two bit counts,
but only one is needed if the bit count of each fingerprint is known.
So these are pre-calculated, which is O(N). I'm also using a 16 bit
look-up table for these bit counts, which is faster than the way
currently in OB.

Chris


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|

Re: Tanimoto (Tc) calculation

Rob Soe
In reply to this post by Igor Filippov
Dear all,
Thanks a lot for the suggestions. I am trying the FastSearch class now and hope it will be faster a bit.
Rob

On Fri, Jun 25, 2010 at 8:49 AM, Igor Filippov <[hidden email]> wrote:

> > I have two lists of molecules. The first list contains testing molecules
> > (test.sdf) and the second contains training molecules (train.sdf).
> >
> > I would like to compare each test molecule to all the training molecule and
> > calculate a corresponding Tanimoto similarity score. (I implemented it in my
> > code and it is super slow as it is O(n^2) ). I use GetFingerprint() and
> > Tanimoto() functions for such purpose. After the comparison, I picked up the
> > kth most similar molecules to the test molecule and predict something for
> > the test molecule. I am trying to make things a bit faster.
>
> There is no way to speed this up - if it's O(N^2) it's O(N^2).
>

There are clever ways to pre-screen your set if you are looking for
molecules above a certain Tanimoto threshold from the other set of
molecules. See for example:
An Intersection Inequality Sharper than the Tanimoto Triangle Inequality
for Efficiently Searching Large Databases
Pierre Baldi and Daniel S. Hirschberg
J. Chem. Inf. Model., 2009, 49 (8), pp 1866–1870
Publication Date (Web): July 14, 2009 (Article)
DOI: 10.1021/ci900133j


If all you want is a list of all Tanimoto coefficients between two sets
then it's indeed O(N^2), I don't see any way around it.
Why would you want such an all-encompassing list though?
It's all how you formulate the problem...

Best regards,
Igor



------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss