- I am using the machine learning algorithms implemented in WEKA data mining toolkit for most of my experiments.
- Whenever I require parameter estimation of the learning algorithm, I split the dataset into training and test sets (typically 10 folds, in the case of 10-fold cross validation), and use the training set for parameter estimation. Performance of the algorithm is computed by averaging over the predictive measures obtained using the test sets of the 10 folds. In this scenario one may require manually splitting the dataset, which I do by using a small script, which I found from somewhere in the internet (unfortunately now I have missed the track, but thankful to the original coder).
- Data sets that fits exactly to our purpose is something challenging in designing the experiments. I have been using some of the graph data sets from the following repositories:
These Medicinal chemistry datasets, which I have used most in my experiments contain chemical graphs of molecules which are labelled active or inactive against a pharmacological property.
This dataset from NCI repository contains about 70,000 compounds, categorized into 72 (overlapped) datasets, which inhibit the growth of different human tumor cells.
More chemoinformatics datasets
- other data repositories containing data sets that could be used for different types and forms of machine learning tasks.
UCI machine learning repository
http://cheminformatics.org/datasets/index.shtml
http://pele.farmbio.uu.se/qsar-ml/qsarml-datasets.html
http://www.infochimps.com/tags/machine-learning
- Other software and methods which either I have used in my experiments or in the related filed (This is more or less for the sake of my own followup).
SUBDUE and MoFa was among my favorite methods, as well as GraphSig
Frequent graph mining with gSpan, FSG, GASTON
Frequent and maximal/closed frequent itemset mining with MAFIA
Constraint programming methods for itemset mining
GraphM is an efficient graph matching software
GraphGen is a synthetic graph generator