• I am using the machine learning algorithms implemented in  WEKA data mining toolkit for most of my experiments.

 

  • Whenever I require parameter estimation of the learning algorithm, I split the dataset into training and test sets (typically 10 folds, in the case of 10-fold cross validation), and use the training set for parameter estimation. Performance of the algorithm is computed by averaging over the predictive measures obtained using the test sets of the 10 folds. In this scenario one may require manually splitting the dataset, which I do by using a small script, which I found from somewhere in the internet (unfortunately now I have missed the track, but thankful to the original coder).

 

  • Data sets that fits exactly to our purpose is something challenging in designing the experiments. I have been using some of the graph data sets from the following repositories:

 

These Medicinal chemistry datasets, which I have used most  in my experiments  contain chemical graphs of molecules which are labelled active or inactive against a pharmacological property.

This dataset from NCI repository contains about 70,000 compounds, categorized into 72 (overlapped) datasets, which inhibit the growth of different human tumor cells.

More chemoinformatics datasets

 

  • other data repositories containing data sets that could be used for different types and forms of machine learning tasks.

 

UCI machine learning repository

ChemDB

http://cheminformatics.org/datasets/index.shtml

UCI KDD archive

http://pele.farmbio.uu.se/qsar-ml/qsarml-datasets.html

QSAR world

 http://www.infochimps.com/tags/machine-learning

More Datasets

 

  • Other software and methods which either I have used in my experiments or in the related filed (This is more or less for the sake of my own followup).

SUBDUE and MoFa was among  my favorite methods, as well as GraphSig

Frequent graph mining with gSpan, FSG, GASTON

Frequent and maximal/closed frequent itemset mining with MAFIA

Constraint programming methods for itemset mining

GraphM is an efficient graph matching software

GraphGen is a  synthetic graph generator