Skip to content

Latest commit

 

History

History

datasets

Datasets of CogDL

CogDL now supports the following datasets for different tasks:

  • Network Embedding (Unsupervised node classification): PPI, Blogcatalog, Wikipedia, Youtube, DBLP, Flickr
  • Semi/Un-superviesd Node classification: Cora, Citeseer, Pubmed, Reddit, PPI, PPI-large, Yelp, Flickr, Amazon
  • Heterogeneous node classification: DBLP, ACM, IMDB
  • Link prediction: PPI, Wikipedia, Blogcatalog
  • Multiplex link prediction: Amazon, YouTube, Twitter
  • graph classification: MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB, NCI, NCI109, Reddit-BINARY

Node classification

Dataset #Nodes #Edges #Features #Classes #Train/Val/Test Degree #Name in Cogdl
Transductive
Cora 2,708 5,429 1,433 7(s) 140 / 500 / 1000 2 cora
Citeseer 3,327 4,732 3,703 6(s) 120 / 500 / 1000 1 citeseer
PubMed 19,717 44,338 500 3(s) 60 / 500 / 1999 2 pubmed
Chameleon 2,277 36,101 2,325 5 0.48 / 0.32 / 0.20 16 chameleon
Cornell 183 298 1,703 5 0.48 / 0.32 / 0.20 1.6 cornell
Film 7,600 30,019 932 5 0.48 / 0.32 / 0.20 4 film
Squirrel 5201 217,073 2,089 5 0.48 / 0.32 / 0.20 41.7 squirrel
Texas 182 325 1,703 5 0.48 / 0.32 / 0.20 1.8 texas
Wisconsin 251 515 1,703 5 0.48 / 0.32 / 0.20 2 Wisconsin
Inductive
PPI 14,755 225,270 50 121(m) 0.66 / 0.12 / 0.22 15 ppi
PPI-large 56,944 818,736 50 121(m) 0.79 / 0.11 / 0.10 14 ppi-large
Reddit 232,965 11,606,919 602 41(s) 0.66 / 0.10 / 0.24 50 reddit
Flickr 89,250 899,756 500 7(s) 0.50 / 0.25 / 0.25 10 flickr
Yelp 716,847 6,977,410 300 100(m) 0.75 / 0.10 / 0.15 10 yelp
Amazon-SAINT 1,598,960 132,169,734 200 107(m) 0.85 / 0.05 / 0.10 83 amazon-s

Network Embedding(Unsupervised Node classification)

Dataset #Nodes #Edges #Classes #Degree #Name in Cogdl
PPI 3,890 76,584 50(m) 20 ppi-ne
BlogCatalog 10,312 333,983 40(m) 32 blogcatalog
Wikipedia 4.777 184,812 39(m) 39 wikipedia
Flickr 80,513 5,899,882 195(m) 73 flickr-ne
DBLP 51,264 2,990,443 60(m) 2 dblp-ne
Youtube 1,138,499 2,990,443 47(m) 3 youtube-ne

Heterogenous Graph

Dataset #Nodes #Edges #Features #Classes #Train/Val/Test #Degree #Edge Type #Name in Cogdl
DBLP 18,405 67,946 334 4 800 / 400 / 2857 4 4 gtn-dblp(han-acm)
ACM 8,994 25,922 1,902 3 600 / 300 / 2125 3 4 gtn-acm(han-acm)
IMDB 12,772 37,288 1,256 3 300 / 300 / 2339 3 4 gtn-imdb(han-imdb)
Amazon-GATNE 10,166 148,863 - - - 15 2 amazon
Youtube-GATNE 2,000 1,310,617 - - - 655 5 youtube
Twitter 10,000 331,899 - - - 33 4 twitter

Knowledge Graph Link Prediction

Dataset #Nodes #Edges #Train/Val/Test #Relations Types #Degree #Name in Cogdl
FB13 75,043 345,872 316,232 / 5,908 / 23,733 12 5 fb13
FB15k 14,951 592,213 483,142 / 50,000 / 59,071 1345 40 fb15k
FB15k-237 14,541 310,116 272,115 / 17,535 / 20,466 237 21 fb15k237
WN18 40,943 151,442 141,442 / 5,000 / 5,000 18 4 wn18
WN18RR 86,835 93,003 86,835 / 3,034 / 3,134 11 1 wn18rr

Graph Classification

TUdataset from https://www.chrsmrrs.com/graphkerneldatasets

Dataset #Graphs #Classes #Avg. Size #Name in Cogdl
MUTAG 188 2 17.9 mutag
IMDB-B 1,000 2 19.8 imdb-b
IMDB-M 1,500 3 13 imdb-m
PROTEINS 1,113 2 39.1 proteins
COLLAB 5,000 5 508.5 collab
NCI1 4,110 2 29.8 nci1
NCI109 4,127 2 39.7 nci109
PTC-MR 344 2 14.3 ptc-mr
REDDIT-BINARY 2,000 2 429.7 reddit-b
REDDIT-MULTI-5k 4,999 5 508.5 reddit-multi-5k
REDDIT-MULTI-12k 11,929 11 391.5 reddit-multi-12k
BBBP 2,039 2 24 bbbp
BACE 1,513 2 34.1 bace