DSI and DELIS

The DELIS (Dynamically Evolving Large-scale Information Systems) European project deals with methods, techniques, tools, and prototypical implementations in order to cope with challenges imposed by the size and dynamics of today's and especially future information systems. Special emphasis is on exploiting synergies between the different groups that are part of the project. The fundamental features of the way networks and systems are studied within DELIS stem from the focus on how individual (user) node behavior impacts the network as a whole, and the focus on whether it is feasible and how to attain a useful networking infrastructure, given individual, self-centered behavior. Thus, the key technical challenges in DELIS aim to create a self-organizing, self-repairing, self-monitoring network infrastructure, given largely autonomous, perhaps egoistic and selfish peers, coexisting with altruists. The DELIS subproject structure exemplifies this.

The DSI (Computer Science Department) of the Università di Milano is part of the Subproject 1 (Monitoring, Visualizing, and Analyzing Large Dynamically Evolving Information Systems) aimed, in particular, at the analysis of the structure of large dynamic networks, specifically of the web graph: given a portion of the web, its web graph is a directed graph whose nodes are the web pages, and with an arc from node A to node B iff there is a hypertextual link to page B contained in the text of page A. Web graphs are of uttermost importance in various fields such as, for example, in the way search engines organize their results, in the search for virtual communities, in the analysis of web spam etc.

With the aim of studying real-world web graph and their evolution in time, we started collecting in a systematic way large portions of the Web; this is the starting point to allow other partners, for example, to assess their statements and models about the temporal evolution of the statistical and topological properties of the web graph and the possible impact of different crawling strategies.

How/when/where the snapshots were collected

Together with our DELIS partners, we decided to take snapshots at a monthly rate focussing on the .uk domain: the choice of the domain was the most obvious, given the European nature of the project and in consideration of the linguistic and social centrality of the UK within Europe; the frequency chosen is the largest possible that does not raise issues of unpoliteness. The first snapshot was collected in May 2006. All snapshots were collected at the DSI, using hardware that was partly funded by the DELIS project.

All the data sets were collected using UbiCrawler, a scalable, fault-tolerant and fully distributed web crawler developed by the Laboratory for Web Algorithmics (LAW) at the DSI, and the corresponding web graphs are made available in this page in the WebGraph compressed format --- please, read the tutorial to learn how to use such format.

How to get the data

Presently the data is available only to the DELIS project members; six months after the end of the project, all the data will be made publicily available.

If you partecipate to the DELIS project and need to access the data, please join the LAW-DELIS mailing-list.

The snapshots

Full-texts

DatasetPagesSize (Gb)GZip Size (Gb)
uk-2006-06 112386763 1893.11 402.45
uk-2006-07 136956559 2287.36 477.03
uk-2006-08 141395895 2424.82 507.59
uk-2006-09 148965298 2756.61 546.70
uk-2006-10 129558491 2336.19 478.31
uk-2006-11 150146132 2637.70 546.81
uk-2006-12 144489446 2552.80 525.77
uk-2007-01 151578113 2651.65 553.97
uk-2007-02 153966540 2692.88 564.98
uk-2007-03 151427461 2568.80 545.80
uk-2007-04 150606689 2700.06 559.84
uk-2007-05 150054551 2658.18 556.46

Graphs

DatasetNodesArcsSize (Gb)bit/arc
uk-2006-06 80644902 2481281617 0.99 3.078
uk-2006-07 96395298 3030665444 1.28 3.303
uk-2006-08 100751978 3250153746 1.35 3.256
uk-2006-09 106288541 3871625613 1.45 2.929
uk-2006-10 93463772 3130910405 1.14 2.829
uk-2006-11 106783458 3479400938 1.29 2.860
uk-2006-12 103098631 3768836665 1.34 2.778
uk-2007-01 108563230 3929837236 1.38 2.723
uk-2007-02 110123614 3944932566 1.39 2.744
uk-2007-03 107565084 3642701825 1.34 2.848
uk-2007-04 106867191 3790305474 1.36 2.792
uk-2007-05 105896555 3738733648 1.30 2.695

Overlap (host count)

 uk-2006-06uk-2006-07uk-2006-08uk-2006-09uk-2006-10uk-2006-11uk-2006-12uk-2007-01uk-2007-02uk-2007-03uk-2007-04uk-2007-05
uk-2006-06949677330471686698996551664501594786245962447589535767157747
uk-2006-07 130778102250994898995190909814918674188143840828273182138
uk-2006-08  1285051028738499990378810238648986762819088006679637
uk-2006-09   1366058800694655843359088789620849938209781156
uk-2006-10    10991886175758318113081614766167566075128
uk-2006-11     121208867149146191549841258232281664
uk-2006-12      1134718885284335792987625475850
uk-2007-01       12513494259864028447483127
uk-2007-02        122956910948786486708
uk-2007-03         1225068497183839
uk-2007-04          11315791636
uk-2007-05           114529

Overlap (static URLs)

 uk-uk-2006-06uk-uk-2006-07uk-uk-2006-08uk-uk-2006-09uk-uk-2006-10uk-uk-2006-11uk-uk-2006-12uk-uk-2007-01uk-uk-2007-02uk-uk-2007-03uk-uk-2007-04uk-uk-2007-05
uk-uk-2006-06313164031903435518260762171699651526448414997442138332251367512613211566123216571191270311142177
uk-uk-2006-07 3516031923301313217060321853181318266515161950461640710915968929151678451454524313577199
uk-uk-2006-08  37263278242655071937250919379724171306921733616316709686155071011502406413802044
uk-uk-2006-09   399460972124095521246302187437491904773518089154168776281630474614740879
uk-uk-2006-10    3381204322246367190412331905926718264341166262531635545715007365
uk-uk-2006-11     37337242222974482188283021279911188264361826455316568057
uk-uk-2006-12      366410562352646720984358189993861790395216621065
uk-uk-2007-01       3904225723702058207733731993271718394875
uk-uk-2007-02        37693732230767282218033719977866
uk-uk-2007-03         381097222236412620204640
uk-uk-2007-04          3689684924202971
uk-uk-2007-05           36864749