Resources

I am using the machine learning algorithms implemented in WEKA data mining toolkit for most of my experiments.

Whenever I require parameter estimation of the learning algorithm, I split the dataset into training and test sets (typically 10 folds, in the case of 10-fold cross validation), and use the training set for parameter estimation. Performance of the algorithm is computed by averaging over the predictive measures obtained using the test sets of the 10 folds. In this scenario one may require manually splitting the dataset, which I do by using a small script, which I found from somewhere in the internet (unfortunately now I have missed the track, but thankful to the original coder).

Data sets that fits exactly to our purpose is something challenging in designing the experiments. I have been using some of the graph data sets from the following repositories:

These Medicinal chemistry datasets, which I have used most in my experiments contain chemical graphs of molecules which are labelled active or inactive against a pharmacological property.

This dataset from NCI repository contains about 70,000 compounds, categorized into 72 (overlapped) datasets, which inhibit the growth of different human tumor cells.

More chemoinformatics datasets