[Open Babel] More on docs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Open Babel] More on docs

drc-2
Hi,

Is there a full description/definition of the fingerprints anywhere?

Chris



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.  Get Certified Today
Register for a JBoss Training Course.  Free Certification Exam
for All Training Attendees Through End of 2005. For more info visit:
http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Open Babel] More on docs

Chris Morley-3
[hidden email] wrote:
  > Is there a full description/definition of the
fingerprints anywhere?

I'm not sure what exactly you are after. There are general
articles on fingerprints and their use at:
http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
http://www.mesaac.com/Fingerprint.htm

The original OpenBabel fingerprint was written by Fabien
Fontain and is not currently implemented - its concept has
been recoded and extended in FP2. Here is his original
description, which mostly applies to FP2 as well.

The fingerprint computation requires the execution of the
following steps:
- Obtain the fragments
- Remove duplicate fragments and get a hash number for
each fragment
- Obtain the fingerprint of the fragments

The fragments are obtained by means of a recursive
algorithm. The algorithm finds all linear fragments up to
a size of seven atoms. Cyclic fragments are also
identified by checking if there are ring closures in
the linear fragment. However, the algorithm does not
identify branched fragments if there are not part of a ring.
for example CC(C)C will have the fragments:
C
CC
CCC
and C1C(C)C1 will have the fragments:
C
CC
CCC
C1CC1
CC1CC1

Duplicate fragments, i.e. fragment generated by the same
atoms, are removed from the fragment list. The only
fragment that are kept are the ones which produce the
lowest hash numbers. The hash number for a fragment is an
integer number which is generated from the atoms and bonds
of the fragment. The value of this number depends on the
position, the atomic number and type (aromatic or not) of
the atoms of the fragment, the position and the type of
the bonds of the fragments, the position and the type of
the ring closures.

The fingerprint is a bitstring of 1021 bits. It is evident
that there are many more possible fragments than 1021, and
consequently the hash number is often higher than 1021. It
is easy to obtain a number lower than 1021 from the hash
number, just by dividing the hash number by 1021 and by
keeping the reminder. this reminder is then used to set a
bit to one in the fingerprint. Actually, the size of the
bitstring is of 1024 bits but we divide by 1021 because
1021 is a prime number which produces a better hashing
than an even number such as 1024. Consequently, the three
last bits of the fingerprint are always set to zero.

In OB there are currently three fingerprint types available:

FP2  - description from finger2.cpp
Similar to Fabien Fontain's fingerprint class, with a
slightly improved algorithm, but re-written using STL
which makes it shorter.

A molecule structure is analysed to identify linear
fragments of length from one to Max_Fragment_Size = 7
atoms but single atom fragments of C,N,and O are ignored.
A fragment is terminated when the atoms form a ring.

For each of these fragments the atoms, bonding and whether
they constitute a complete ring is recorded and saved in a
std::set so that there is only one of each fragment type.
Chemically identical versions, i.e. ones with the atoms
listed in reverse order and rings listed starting at
different atoms, are identified and only a single
canonical fragment is retained.

Each remaining fragment is assigned a hash number from 0
to 1020 which is used to set a bit in a 1024 bit vector.

FP3 - description from finger3.cpp
A bit is set when there is a match to one of a list
of SMARTS patterns in the file (default is patterns.txt).
Looks for this file first in the folder in the environment
variable BABEL_DATADIR, then in the folder specified by
the macro BABEL_DATADIR (probably set during compilation
in babelconfig.h), and then in the current folder.

On each line there is a SMARTS string and anything
after a space is ignored. Lines starting with # are ignored.

Additional fingerprint types using patterns in different
files can be made by just declaring separate instances like:
\code
PatternFP myPatternFP("myFP", "myPatternFile.txt");
\endcode

Patterns.txt currently has a very incomplete set of SMARTS
patterns based on the chemical types used by checkmol
http://merian.pch.univie.ac.at/~nhaider/cheminf/fgtable.pdf
Until this is completed (if it ever is) FP2 is not much
use. But riding to the rescue (I am very pleased to say) is...

FP4 - recently contributed by Christian Laggner from work
he had done for Inte:Ligand (http://www.inteligand.com/).
He says:
"In this list, I tried to describe functional groups from
the viewpoint of an organic chemist.
* The patterns are very specific (e.g. find amines but not
aminals, amides but no imides and urea,...) and thus
sometimes quite long and complicated. I don't actually
know whether this is a good thing for 2D fingerprint
indexing, which should be *fast*. So maybe one would need
a smaller, more general list of functional groups (maybe a
slow "big set" and a simpler "fast set")?
* The list is not fully finished (you'll find some
commentaries for further ideas in the text)."
There is more description in the data file:
data/SMARTS_InteLigand_051110.txt
The format of the file is different from patterns.txt but
the code will read both.
Other than to check that it is functional(I hope), there
has been no testing of this fingerprint in OB yet, but
Christian and myself would be very pleased if anybody
could give it a try on their type of problem and share
their experiences. The recent snapshot of OpenBabel
v2.0.0rc1 is ready to be built in UNIX (or there is a
ready-compiled version for Windows). The fastsearch format
does substructure and similarity searching and is easy to
use. The fingerprint format does lightweight display of
fingerprints and Tanimoto coefficients (see recent posts).

Fingerprints aren't necessarily the only way that these
SMARTS lists could be exploited in OpenBabel. I don't know
much about the subject, but it seems to me that there
might be other  interfaces which would make better use
than fingerprints do of the intelligence put into the
construction of these lists. Any suggestions?

Chris




-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.  Get Certified Today
Register for a JBoss Training Course.  Free Certification Exam
for All Training Attendees Through End of 2005. For more info visit:
http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click
_______________________________________________
OpenBabel-discuss mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Loading...