See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/308350800

Block-based Migration from HTML4 Standard to HTML5 Standard in the Context
of Web Archives
Conference Paper · May 2016

CITATION

READS

1

305

2 authors:
Andres Sanoja

Stéphane Gançarski

Central University of Venezuela

Sorbonne Université

21 PUBLICATIONS 105 CITATIONS

83 PUBLICATIONS 523 CITATIONS

SEE PROFILE

All content following this page was uploaded by Andres Sanoja on 20 September 2016.
The user has requested enhancement of the downloaded file.

SEE PROFILE

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

Block-based Migration from HTML4 Standard to HTML5
Standard in the Context of Web Archives
Andrés Sanoja1 , Stéphane Gançarski2
andres.sanoja@ciens.ucv.ve, stephane.gancarski@lip6.fr
1

2

Escuela de Computación, Universidad Central de Venezuela, Caracas, Venezuela
Laboratoire d’Informatique de Paris 6, Université Pierre et Marie Curie, Paris, France

Abstract: Web archives are not exempt of format obsolescence. In the near future Web pages written in HTML4 format,
could be obsolete. We will have to choose between two preservation strategies: emulation or migration. The first option is the
most evident, however due to the size of the Web and the amount of information that Web archives handle it is not practical.
In the other hand migration to HTML5 format seems plausible. This is a challenge because we need to modify a page (in
HTML4 format) and include elements that not even exists in this format (as the HTML5 semantic elements). Using the Web
page segmentation we show that, with the appropriate granularity, blocks look alike these semantic elements. We present the
use our segmentation tool, BoM (Block-o-Matic), for helping achieve the migration of Web pages from HTML4 format to
HTML5 format in the context of Web archives. We also present an evaluation framework for Web page segmentation, that
helps to produce metrics needed to compare the original and migrated version. If both versions are similar the migration has
been successful. We show the experiments and results obtained on a sample of 40 pages. We made the manual segmentations
for each page using our MoB tool. Results shows that in the migration process there is no data loss but in the migrated
version (after adding the semantic elements) the margin is changed. This is, it adds whitespace that change the elements
position, shifting elements slightly on the page. While this is imperceptible to the human eye, for systems it is difficult to
handle without previous knowledge of this situation.
Keywords: Migration; Web; Segmentation; Blocks; HTML5; Web Archive; Format Obsolescence.
Rosenthal [14] that eventually, modern browsers will no longer
be able to render document in HTML4 or XHTML formats in
a proper way (i.e they will not be very permissive). Thus, a
strategy for their preservation must be taken. Archivists must
decide to perform either a emulation or migration.
In the context of digital preservation the emulation is “the
replicating of functionality of an obsolete system, but on
the hard- and software environment in which the object
is rendered” [20]. In other words emulation consists in
recreating the environment in which a Web page was originally
created. This implies keeping old versions of tools or old
tools. Migration refers to transferring data to newer system
environments [3]. This includes converting a Web page file
from one file format to that another so the resource including
its functionalities remains fully accessible.
Rosenthal also describes the difficulties of using only
emulation. Its cost is very high in terms of storage and
operation. Conversely, migration of Web content from an
obsolete format to a current one seems to be a good strategy to
minimize emulation, but it increases data duplication and there
is the risk of loosing document information in the process.
The obsolescence of Web content is usually associated with
its presentation, that is, its rendering and visual aesthetic.
However, the document semantic should be also taken into
account also. The main goal of HTML5 is to improve the
language, keeping it readable by humans and by computers and

1. I NTRODUCTION
Obsolescence, adjustment, and renewal are necessary parts
of the development cycle. Improvements usually require
changes. That includes technologies, products, processes,
and people, as well. In July 2012, the WWW Consortium
introduced a recommendation for HTML51 . It represents an
important change regarding the preceding version of HTML
and the XHTML specification. For instance it introduces the
semantic tags allowing browsers to easily access contents,
audio and video among others. The first question raised by
HTML5 is: why to use it? Laws [6] discusses this from the
competition point of view and he concludes that organizations
and publishers need to be ready for this technological change
if they want to outperform their competitors and stay in
the technological race. This raises another question: once
publishers switch to HTML5, what happens with the current
HTML4 content? The W3C and the WHAT group are figuring
out how Web browsers can be compatible with older versions
of the specification. They say that is necessary to evolve HTML
incrementally into XML2 . The strategy is to process this pages
differently. So far, Web browsers have been very permissive
with malformed documents. In general, Web archives store
pages along with all their dependencies. We agree with
1 The

proposed recommendation is out September 2014

2 http://diveintohtml5.info

72

Sección 2: Web Archiving

information in the ETD and metadata available for Web search
indexing and other general tasks.
Roshental [14] present and describe the design and
implementation of a transparent, on-access format migration
capability for the LOCKSS system for preserving Web content.
Their implementation is capable of transparently presenting
content collected in one Web format to readers in another Web
format, with no changes needed to browsers. They present
an user case of this type of migration on GIF image format
migrated to PNG format. They identify the practical difficulties
that face any implementation of emulation; they led them to
choose the migration strategy
Conversely, Jackson [4] describe a method to identify how
HTML and PDF formats changes in Web archives though
time. They conclude that software obsolescence is rare on the
Web and uncover evidence indicating that network effects act
to stabilise formats against obsolescence.
However, we think that obsolescence can occurs in Web
environments. We observe this behaviour with old plugins (e.g.
Macromedia Shockwave content) in old Web pages. We agree
with Rosenthal that any format is susceptible of been obsolete,
and the HTML4, and earlier formats, are not the exception.
In the following sections we present our approach to Web page
segmentation and its evaluation as a preliminary to describe
our migration approach.

useful, and able to enrich the semantic content of documents.
In this article we present how we use Web page segmentation
to perform the migration of HTML4 pages to HTML5 format.
We think that a block-based solution is more effective than
a tag-by-tag approach, since we must differentiate between
"regular" tags and "semantic" tags.
Semantic tags (in theory) have no impact in the rendering of
the page, but they help to organize the content into coherent
regions. Thus, using segmentation seems relevant for the
migration, which can be performed by segmenting HTML4
pages and incorporating semantic tags to the result.
To measure the correctness of a migration we perform an Web
page segmentation evaluation with a set of predefined manual
segmentations and the corresponding migrated versions.
To this end, we present Block-o-Matic (BoM), our segmentation approach, and the model for evaluating segmentation
algorithms. We apply both in this work to measure the
correctness of the migration. The manual segmentation is
made using the Manual-design-Of-Blocks tool (MoB) and a
computed one, made with BoM. In this process we give a score
based on the geometry of both segmentations. In addition to
this the labels of each block are also compared.
The document is organized as follow. In Section 4 we present
the Web page segmentation concepts and notation. In Section
5 is presented BoM our approach to Web page segmentation.
In Section 6 we present our evaluation framework. In Section 7
our solution, while in Section 8 the experiments and in Section
9 the results. We conclude in Section 10 with the perspectives
and outlook.

3. OVERVIEW OF THE M IGRATION P ROCESS
In this section we present an overview of the migration process.
The idea is to take a Web page in HTML4 format and produce
a version of the same page according to the HTML5 format.
The main goal of the process described in this paper is to
measure a what extent this process is correct, and how reliable
it is.
The process is illustrated in Figure 1 and can be divided in five
steps, describe as follows:

2. R ELATED W ORK
Several efforts have taken place in order to make uniform the
migration from one format to another [13]. Existing methods
usually perform a tag-by-tag migration, in other words they
translate tags. However, it is difficult to define an appropriated
translation of HTML5 semantic tags (which defines the layout
of the Web page) from HTML4 pages where such tags do not
exist.
There is a lot of online references to perform the tag-by-tag
migration3 however, as far as our knowledge goes, there are
very few systematic and automatic approaches to solve the
problem described above.
As an example, Park [11], present their experience in the
migration of ETD (Electronic Theses and Dissertations) from
the PDF format to HTML5 format. Most of ETD have linked
multimedia documents and connected by hyperlinks (in PDF
format). Storing them in this format, requires to have the
corresponding multimedia readers, libraries and plug-in, as
well. HTML5 is a convenient migration format because in
this way it is possible to have one single file that has all of
the content linked together, including all of the multimedia
3 Googling

1) Segmentation of the input page: a Web page in HTML4
format is segmented using the BoM segmenter (c.f.
Section 5).
2) Automatic label assignment: Based on the properties
and characteristics of the blocks found in the segmentation
we assign a label (i.e. HTML5 semantic elements tags) to
each block (c.f. Section 8.4).
3) Manual segmentation and label assignment: Using
the MoB tool (c.f. Section 8.2) we produce a ideal
segmentation of the input page. In the same process the
user assign a label to each block.
4) Measure of labels: from both segmentations (i.e. the
manual and automatic one) we apply some measures
to determine how different both assignments are. The
metrics are described in detail in Section 8.5.
5) Measure of rendering errors: Using the Web page
segmentation evaluation framework (c.f. Section 6) we
measure the difference on the rendering both of the
automatic and manual segmentation

the term ’translating html 4 tag to html5’ will give these references

73

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

4.1. Concepts
Inspired by the concepts presented by Tang [19] and Nie [10],
we describe the Web page segmentation with the following
abstractions:
•
•

•

Figure 1: Migration Overview
•

From the automatic segmentation outcome it is possible to
produce the HTML representation, that is, the migrated Web
page. This detail is not included in this paper, but technically
is the transformation of a XML document into a HTML Web
page.
•

4. W EB PAGE S EGMENTATION
Web page segmentation refers to the process of dividing a
Web page into visually and semantically coherent segments
called blocks. For determining the coherence of each segment
we relies on the content categories classifications made by the
W3C for the HTML 5 specification (e.g. sectioning content).
Detecting these different blocks is a crucial step for many
applications, such as mobile devices [23], information retrieval
[1], Web archiving [15], Web accessibility [8], evaluating visual
quality (aesthetics) [22], among others. In the context of Web
archiving, segmentation can be used to extract interesting parts
to be stored. By giving relative weights to blocks according to
their importance, it also allows for detecting important changes
(changes in important blocks) between pages versions [12].
This is useful for crawling optimization, as it permits tuning
crawlers so that they will revisit pages with important changes
more often [15]. It also helps for controlling preservation
actions, by comparing the page version before and after the
action.
It is crucial for Web page segmentation to know which elements
of the page are considered. For a Web page we extract visual
and structural aspects found in the rendered DOM (W ) of a Web
page. From its structure we extract the elements in the form of
a hierarchy (DOM tree), the root element (W.root). We obtain
the text of the page (W.text) by recursively concatenating the
the text of all elements. Each element corresponds to a HTML
5 content category. From its visual we get the visual cues
(lines, blank areas, colors, pictures, fonts, etc) and the boxes
of each element (rectangles). We have a special box called
viewport representing the body element.

•

•

Page is a special block that represents the whole Web page
and covers the whole Viewport.
Simple block is an element or a group of elements. It is also
denoted simply as Block. It is represented as a rectangular
area resulting of merging the boxes of elements. Each
block has a label related with those of the underlying
elements. It is also associated with the text of those
elements.
Composite block is a special block that can contain
other blocks. Usually such blocks correspond to template
elements.
Block graph is a connected planar graph representing the
blocks and their relationships (e.g. parent/child). It can
be an edge-weighted graph (each edge has a weight), or
a vertex-weighted graph (each vertex has been assigned
a weight). A weight associated with a vertex usually
represents how coherent a blocks is, while a weight
associated with an edge usually represents the cost of
merging two blocks, distance or similarity between blocks.
Geometric model represents the set of blocks as a set of
rectangles in a plane. They are obtained from the scheme
of the Web page. All rectangles are modeled as quadruples
(x,y,w,h), where x and y are the coordinates of the origin
point and w and h are the width and height of the rectangle.
Blocks can be represented in the plane as a hierarchy or
a set of non-overlapping rectangles, called Manhattan
layout [19]. It can be hierarchical [1] or non-hierarchical
[2], [5]. The latter can be obtained from the former by
only considering the leaves.
Stop condition is a predefined value (real number) used by
algorithms that indicates when a segmentation is achieved.
It its based on the edge/vertex weights of the block graph.
An algorithm may have one or more stop conditions.
Label is the role that a block plays in the Web page such
as navigation, content, header, footer, etc.

4.2. Notation
We present in this section several definitions, in order to have an
uniform presentation of Web page segmentations algorithms.
a) The Segmentation Function: The segmentation function
Φ is described as follows:
ΦA (W, SC) −→ (WA0 , GMA )

(1)

where A is a Web page segmentation algorithm, W is the
rendered DOM of a Web page, SC is a set of stop conditions.
WA0 is the block graph defined just below and GMA is a
set of rectangles representing the geometric model of the
segmentation.

74

Sección 2: Web Archiving

0
Consider WBoM
as a rooted, planar and vertex-weighted tree.
The root vertex is the Page block, inner vertices are the
composite blocks, terminal vertices are the simple blocks.
The edges between blocks represent a hierarchical relationship
of geometric containment. In other words, consider P age, Bc
and Bp ∈ Blocks, the following constraints apply:

b) The Block Graph: The block graph is defined as a
planar graph WA0 = (Blocks, Edges). Each vertex B in
Blocks corresponds to a rectangle in GMA (denoted B.rect)
and a label (denoted B.label). It is associated with a
function weight on the edges and vertices, and two subset
of vertices: SimpleBlocks ⊂ Blocks (also called terminals),
CompositeBlocks ⊂ Blocks, which includes a special vertex
P age, labeled as the root of the graph.
The rectangle of the vertex Page covers the whole viewport of
the Web page W and all the blocks fit in. Thus,

1) For every pair of blocks (Bc , Bp ), where Bp is the parent
0
of Bc in the WBoM
tree, we write Bc child of Bp and Bp
parent of Bc .
2) For every block Bc , child of Bp , Bc .rect is contained in
Bp .rect

∀ B ∈ Blocks, B.rect ⊆ P age.rect

∀ Bc , Bp , Bc child of Bp ⇒ Bc .rect ⊂ Bp .rect

The weight of a vertex B is noted as B.weight. The weight of
an edge E is noted as E.weight
Usually the block graph is a tree. However, some algorithms
such as Homory-HuPS [7] and GraphBased [2] define it as a
general planar graph.

3) The Page rectangle cover the whole page and all blocks
fit inside it.
∀ b ∈ Blocks, b.rect ⊆ P age.rect
Only simple blocks are associated to DOM elements, thus for
the page and composite blocks the B.elements is an empty
set.
The weight of a block is the normalized area of its rectangle.
It is used to check the stop condition (cf. section 2). Thus, the
weight of a block B is:

5. B LOCK - O -M ATIC (B O M): A N EW W EB PAGE
S EGMENTER
In this section we present BoM, our Web page segmentation
approach. One of the main features of BoM is that we segment
a Web page without having previous knowledge of its content
and using only the heuristic rules defined by the W3C Web
standards. For instance, we detect blocks using HTML5 content
categories instead of using the tag names or text features. That
gives genericity to BoM and allow it (in theory) segmenting all
types of Web pages.
Another feature of our approach is the introduction of methods
and techniques of document processing systems. We leverage
existing techniques from the field of computer vision for
segmenting scanned documents, in order to adapt them to
Web pages. This produces more interesting results for the
applications that depends on the segmentation, such as blocks
labels.
Let W be the rendered DOM of a Web page. A segmentation
ΦBoM of W is defined as follows :

B.weight = 0.1 ×

B.rect.w × B.rect.h × pND
P age.rect.w × P age.rect.h

where pND is the predefined constant. In this work we fix this
value to pN D=100, so that both B.weight and pA belongs to
the interval [0,10].
5.1. Model
In this section we present the Web page segmentation model.
It is an hybrid approach, and it follows the bottom-up strategy
[21].
First, we describe the segmentation as a black box indicating
its input and output. A more detailed explanation follows,
describing the three sub-processes that achieve the final
segmentation.
We define the Web page segmentation as the process of finding
coherent regions of content (blocks) into the rendered DOM
0
(W ) of a Web page. As a result, the block graph WBoM
and
the geometric model GMBoM are produced. The block graph
is a tree structure as defined in section 4.1.
Figure 2 shows how a rendered Web page W is segmented.
0
The output is the block graph WBoM
shown on the right side of
the figure and the geometric model in the center of the figure.
The sub-processes of the segmentation are:

0
ΦBoM (W, pA, pD, pN D) = (WBoM
, GMBoM )
0
where WBoM
is the block graph (a tree) of the segmentation,
GMBoM is the geometric model and pA, the stop condition.
In BoM, the stop condition is the normalized area parameter
which is the proportional size of a block respect to the page.
We include other parameters used in the algorithm: pD is the
Distance parameter used for merging blocks. pND which is
used to compute the normalized area and the weights of blocks.
The pA and pD parameters are described on detail in section
5.3 and 5.4. The pND is described at the end of this section for
computing the weight of a block.
Each block B is associated with its rectangle (B.rect), its label
(B.label), its weight (B.weight) as defined in Section 4.1, and
a set of DOM elements (B.elements).

1) Fine-grained segmentation construction. Builds the
0
fine-grained segmentation of W producing WBoM
and
GMBoM .
2) Composite block. Detects the composite blocks. This
0
sub-process updates WBoM
and GMBoM

75

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

3) Merging blocks. Merges blocks according to their area,
distance, alignment, labels and content categories. This
0
sub-process produces the final version of WBoM
and
GMBoM .

1
2
3

5.2. Fine-grained Segmentation Construction
The idea of the fine-grained segmentation is to find coherent
blocks as small as possible. It serves as a starting point for the
whole process by creating a first version of the block graph
0
WBoM
and the geometric model GMBoM . The condition C
that a DOM element must satisfy to be considered as a block is
that it does not belongs to the following content categories: text,
phrasing, embedded, interactive or form-associated elements.
The value to the label (B.label) is the most inclusive content
category of its elements (B.elements). For instance, if the
block has one element which content category is flow the label
of the block is the same. If the block is associated with two
elements, one element in the embedded category and the other
in the heading category, the most inclusive category is flow.
Figure 4 shows which content category includes other content
categories.
The process begins from the leaves of the DOM tree, towards
the W.root. If an element is found that meets the condition
C above defined, the process stops for this branch. Figure 3
shows how an element is selected as a block. Element li is the
first element that does not belong to the categories above listed,
then it is marked as a block and the label flow is assigned. From
the information obtained during this sub-process a geometric
model (cf. section 4.1) and a first version of the block graph
are built (cf. section 4.1).
Algorithm 1 shows the steps to build the fine-grained
segmentation. First, the rendered DOM tree W is traversed
and leaves elements are selected (line 5). If a selected element
does not match the condition C its parent become the current
element (line 7-8).
The same process continues until either the W.root element
(i.e.: the body element) is reached or the current element meet
the condition C. If the condition C is met a new block is created
(lines 10-11). The element becomes the block’s element (line
12), the block label is the element category (line 13), a new
rectangle is created (line 14), the geometric model is updated
(line 15) and the weight is computed (line 17). The rectangle
is based on the box of the element and it is associated to the
block (line 16). The block graph is updated with the new block
b, adding an edge between the Page block and block b (lines
18-19)
The fine-grained segmentation form a flat segmentation, that is
height(P age) = 1.

4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

Data: Rendered DOM : W
0
Result: block graph WBoM
, geometric model GMBoM
Blocks = {P age};
E = {};
0
WBoM
= (Blocks, E);
GMBoM = {};
Terminal ← getTerminalElements(W);
foreach element ∈ Terminal do
while element 6= W.root and ¬C(element) do
element ← element.parentElement;
end
if element 6= W.root then
create block b;
b.elements ← element;
b.label = element.category;
rect = createRectangleFromElement(element);
add rectangle rect to GMBoM ;
b.rect = rect;
b.weight = normalized_area(b);
0
;
add vertex b to WBoM
add edge (P age, b) to E;
end
end
Algorithm 1: Fine-grained Segmentation Construction Algorithm

horizontal separation line S in a block is represented by the line
formed by the points (x1 , y1 ) and (x2 , y2 ), where y1 = y2 if it
is horizontal, x1 = x2 if it is vertical. The spaces found either
at the beginning or at the end of the document are omitted.
Algorithm 2 shows the CompositeBlockDetection function
in order to find the composite blocks and the flow of a
segmentation. It accepts a composite block as input and outputs
0
the WBoM
graph and the geometric model GM updated with
new blocks (if any) and including the computed order.
We start finding the composite blocks in the Page block
itself, considered as a composite. Two composite blocks are
formed on both sides of the separation line (line 12). All
simple blocks that are covered by these new blocks are
aggregated accordingly and become their children blocks (line
22). The process stops if it is met one of two conditions: their
weights are below the predefined stop condition parameter
(pA) or the horizontal or vertical limits of the block are not
those of the Page (line 1), i.e. if B.rect.x > P age.rect.x
and B.rect.w < P age.rect.w (respectively B.rect.y >
P age.rect.y and B.rect.h < P age.rect.h).
1
2
Figure 2 shows the separation lines, Spage
and Spage
, found
in the Page block, generating blocks 1, 2 and 3. On the same
figure, block 1 and 3 are not processed because their weights
are higher than pA, but the same process is applied to block
2. First the horizontal separator S21 is discovered, generating
the composite blocks 2.1 and 2.2. We assume that the weight
of block 2.2 is below the predefined stop condition parameter,
thus no further processing is needed. However, in block 2.1,

5.3. Composite Block
Composite blocks usually are Web page regions that lie along
separation lines. A separation line is the space that goes from
one limit of the page to another without crossing any block. A

76

Sección 2: Web Archiving

Figure 2: Segmentation Model Example

1
2
3

Figure 3: Block Detection Based on Content Categories

4
5
6
7

1
2
two vertical separators S2.1
and S2.1
are found.

8
9

5.4. Merging Blocks
Once composite blocks are created, the merging process starts.
This process allows obtaining simple blocks the weight of
which is greater than the predefined stop condition parameter
(pA). Two blocks are merged if the following heuristic rules
are all satisfied:
1) Their weights are less than the the predefined stop
condition parameter.
2) The distance between them is below a predefined distance
parameter pD.
3) Both blocks are horizontal or vertical aligned with a
tolerance than no more that pD pixels.
4) They are not aligned but one’s rectangle covers completely
the other’s one.
5) Their label is not sectioning.

10
11
12
13
14
15
16
17
18
19
20
21
22
23

The rules are checked in the given order for efficiency purpose:
the first rules are most discriminant.
This process is repeated until no more merges are possible.
Then we check if the proportion of blocks with a weight less
than pA is greater than a constant (for instance 75%). If it is the
case, all the children of the composite blocks are removed. If
the composite block has only one child, this latter is removed.
To illustrate the merging process, let pA = 4, pD = 50 and
pN D = 100. Figure 5 shows the merging process for the block

Data: block b
0
Result: WBoM
and GMBoM updated
if b limits equals to P age and b.weight > pA then
Separators ← findSeparatorsIn(b);
foreach s ∈ Separators do
if s is horizontal then
rect1 = {b.rect.x, b.rect.y, b.rect.w, s.y1 };
rect2 = {b.rect.x, s.y1 , b.rect.w, b.rect.h};
else
rect1 = {b.rect.x, b.rect.y, s.x1 , b.rect.h};
rect2 = {s.x1 , b.rect.y, b.rect.w, b.rect.h};
end
add rectangles rect1 , rect2 to GMBoM ;
create blocks b1 , b2 ;
b1 .rect = rect1 ;
b2 .rect = rect2 ;
0
;
add vertices b1 , b2 to WBoM
add edge (b, b1 ) to E;
CompositeBlockDetection(b1 );
add edge (b, b2 ) to E;
CompositeBlockDetection(b2 );
end
else
0
update WBoM
and GM to associate blocks covered by b
end
Algorithm 2: Composite Blocks Detection Algorithm

2.1.2 of an example page. Each blocks has its weight and its
label. In Figure 5a blocks a, b and c are merged because they
are aligned and the distance between them is less than pD.
The label flow is assigned. The same applies for blocks e and
h. However blocks d and f are too far. Blocks f and g are not
aligned. Figure 5b shows the result of merging those blocks
and in a second round the blocks d and e are merged because
their distance is below the parameter pD and they are aligned
using the tolerance. Figure 5c show the merged blocks. Block

77

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

1
2
3
4
5
6
7
8

Figure 4: HTML5 Content Models. Source: http://www.w3.org
9

f is contained into block d, so they are merged and the label
flow is assigned. Figure 5c shows the final merging, the process
stops because the weight of both blocks a and d is greater than
the predefined stop condition pA = 4.
Algorithm 3 presents details about the algorithm for merging
blocks. We only consider the composite blocks that have
simple blocks as children and the weight of which is greater
than the predefined stop condition parameter (pA). If it is the
case we try to merge the children.

10

11
12
13
14
15
16
17
18
19

6. S EGMENTATION E VALUATION M ODEL
In this section we present our approach to Web page
segmentation. We aim segmenting a Web page without previous
knowledge about its content. This allows segmenting different
type of Web pages. The heuristic rules are based solely on
rules defined in the Web standards, such as content categories.
We do not do any assumption about the text. However, this
can be a weakness because in some cases analyzing the text
can be relevant. For instance, two consecutive blocks that talk
about different subjects should not be merged. Solving this
issue would imply studying the semantics of the block content
and is out of the scope of this work.
There are three different implementations of the BoM
algorithm. One version is developed as a Ruby application,
the second as a Java application and the third as a JavaScript
library. The Ruby version is intended as functional prototype,
the Java version to production environments for the European
project SCAPE4 and the JavaScript version for the open source
community5 .
Introducing concept and techniques from the computer vision
field of scanned document image segmentation allow having
a more complete segmentation, as it contains more useful
information for applications than most of the other segmenters.
Evaluating web page segmentation algorithms is not an easy
task. Usually, each algorithm proposes its own adhoc validation
mechanism that can not be really applied to other approaches.
This section attempts to close this gap by proposing a number

20
21
22
23
24
25

26
27
28
29
30

Data: composite block b
0
Result: WBoM
, GMBoM updated
if b.weight > pA then
Children ← getChildren(b);
foreach child ∈ Children do
if child.weight < pA then
Siblings ← getSiblings(child);
foreach sibling ∈ Siblings do
if child and sibling are aligned then
if distance between child and sibling less
than pD then
if labels of child and sibling are not
sectioning then
merge sibling with child as
child;
label child from both labels;
end
end
else
if child covers sibling then
merge sibling with child as child;
end
end
end
end
end
if |getChildren(b)| = 1 then
remove child of b;
end
if proportion of non merged small children is superior to
75% then
remove children of b;
end
else
remove children of b;
end
Algorithm 3: Merging Algorithm

of evaluation metrics that essentially measure how well the
generated segmentation maps to a ground truth segmentation.
This can be formulated as a graph matching problem, and we
propose a number of metrics based on the generated matching
to assess the quality of the generated blocks.
In this section, we present our evaluation model in order
to measure the quality of a segmentation according to a
discrepancy parameter (i.e.: determine how far the two
segmentation are one from the others). The goal of the
evaluation model is to compare an automated segmentation of
a web page W with the corresponding ground truth, in order
to determine its quality. Both segmentations are organized as
non-hierarchical Manhattan layout, in other words, they are
flat segmentations. Our evaluation model is an adaptation to
web pages of the model presented by [16] for scanned page
segmentation evaluation (see Section 6.1). The quality of a

4 http://www.openplanetsfoundation.org/blogs/2014-02-12-scape-qa-tool-

technologies-behind-pagelyzer-ii-web-page-segmentation
5 https://github.com/openplanets/pagelyzer/tree/master/SettingsFiles/js

78

Sección 2: Web Archiving

Figure 5: Merging Blocks and Labeling

block and block graph based on the concepts described in the
same section.

segmentation is evaluated by using the block correspondence.
The block correspondence measures allows knowing to what
extent the generated blocks match those of the ground truth.
We present the evaluation model adaptation (6.1), the
representation of a segmentation (6.2) and the representation
of the evaluation (6.4).

a) Absolute Representation of a Segmentation: Each block
B is associated with its rectangle (B.rect), its label (B.label)
and its weight (B.weight). To each B we add three values:
the amount of elements it covers (B.ec), the text associated to
the block (B.text) in the original page W and the importance
(B.importance). Note that B.ec = |B.elements|.
The importance of a block depends on the area covered by its
rectangle. Section 6.3 explain how it is computed.
An absolute segmentation for the rendered DOM W, using the
algorithm A and SC a set of stop conditions, is defined by the
following function Φ:

6.1. Model Adaptation
In order to adapt to web pages the model presented by Shafait
et al. [16] for scanned page segmentation evaluation we need to
identify the different aspects of both type of documents. Shafait
represent a segmentation of scanned documents images using
a pixel-based representation. Each foreground pixel belongs to
a zone or region. The evaluated documents (and the ground
truth) must have the same dimension.
Their evaluation model defines several performance metrics to
evaluate different aspects of the behaviour of a scanned page
segmentation in image form. These metrics allow measuring
the correspondence of each pair of rectangles the segmentation
and the ground truth. A region (or block) is significant if it the
amount of foreground pixels associated with it is greater than
a parameter.
By analogy, web pages consist of elements and text. In our
adaptation, a block is significant if the amount of elements and
text is greater than a parameter. Other features of our model
are intrinsic to web pages, such as the block importance.

ΦA (W, SC) −→ (WA0 , GMA )
where WA0 is the block graph and GMA is a set of rectangles
representing the geometric model of the segmentation.
Consider WA0 as a rooted, planar and vertex-weighted tree. The
root vertex is the Page block and the terminal vertices are the
simple blocks. We consider the segmentation as flat, that is
the height(P age) ≤ 1. GMA is the geometric model of the
segmentation consisting of a set of rectangles.
b) Normalized Segmentation Representation: In order to
compare two segmentations, we need to normalize the
rectangles.
Given an absolute segmentation ΦA , the geometric model of
its normalized version N ΦA fits in a ND × ND square, where
ND is a fixed value, called Normalized Document Size. In our
experimentation, we fixed this value to 100. Thus if N ΦA is
the normalized segmentation of ΦA :

6.2. Representation of Segmentation
In this section we model a segmentation in order to describe
its evaluation. We describe the absolute and normalized
representation of a segmentation (6.21 and 6.22), as well
as the importance of blocks and how it is computed (6.3).
We present the concepts used along the section. We use the
notation described in Section 4.2. We use the concepts of page,

N ΦA (W, SC) −→ (N WA0 , N GMA )

(2)

where N WA0 is the block graph of the normalized segmentation,

79

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

N GMA is the normalized geometric model. All the
segmentation rectangles are normalized. Thus, the Page block
rectangle is normalized as:

6.4. Representation of the Evaluation
In this section we model the evaluation itself, described in
terms of input and output. We describe also the metrics used in
for measuring the block correspondence (6.5).
The evaluation is described as a function that takes two
segmentations and four constants as parameters. The two
segmentations ΦG and ΦP are absolutes segmentations as
described in section 6.2 producing the block graphs WG0 and
WP0 . The four parameters are the relative tolerance (tr ), the
importance tolerance (ti ), the Normalized Document size (ND)
and the Normalized Partition size (NP) as defined in section
6.21 and 6.22. These parameters are described in detail in the
following sections. The evaluation function returns a vector
of metrics representing the quality of ΦP with respect to ΦG .
Equation 5 shows the function.
The quality of a segmentation is measured by block
correspondence. It measures how well the blocks of WP0
match with the ones of WG0 .
The block correspondence takes into account the location and
geometry of block. It allows for detecting which blocks were
correctly discovered and which ones raised issues.

N WA0 .P age.rect = {0, 0, N D, N D}
Each block rectangle is then normalized according to the
stretch ratio of the page, i.e.
∀ b ∈ N WA0 , b.rect.x =

N D × WA0 .P age.rect.x
WA0 .P age.rect.w

The other values of the block rectangle (y, w and h) are
normalized in the same way.
6.3. Block Importance
The regions in a web page are not all equally important. A
block is more important than another block if it contains more
important information. Usually, important blocks are located
in the most visible part of the page. A good segmentation
algorithm must mostly find important blocks.
The block importance is obtained from the geometric model of
the segmentation, that is the spatial features. A segmentation
is mapped to a grid of NP × NP, where NP is the Normalized
Partition Size. This grid be represented as a matrix IM(NP,NP).
Each cell of the matrix (imij ) is assigned with a value
representing the importance that a block has if it lies within
this area. For instance, with the window spatial features defined
by Song et al. [18], a highest importance is assigned to blocks
found in the middle of the visible part of a web page, and a
lower importance to blocks found outside of this area.
The computed importance of a block is the sum of the cell
values obtained by mapping the block rectangle over the grid.
The rectangle coordinates are divided by the constant NP.
This defines two intervals, one for each dimension. If i and j
respectively belong to those intervals, then the cell value imij
is taken into account. Thus the computed importance of a block
B ∈ WA0 .Blocks is:
computed_importance(B) =

X

imi,j

6.5. Measuring Block Correspondence
The block correspondence indicates whether the blocks
rectangles of a segmentation match those of the ground truth.
Consider two normalized segmentations for a page W : a
computed one N ΦP and the ground truth N ΦG . The associated
normalized block graphs are N WP0 (denoted P in the rest of
the section) and N WG0 (denoted G). Figures 6(a) and (b) give
respectively an example for G and P .
To compute the block correspondence, we build a weighted
bipartite graph called block correspondence graph (BCG). We
start with an example and then give the algorithm.
As seen on Figure 6(c), nodes of the BCG are the blocks of
P and of G. An edge is added between each couple of nodes
ni and nj such that the weight w(ni , nj ) of the edge is equal
to the number of underlying HTML elements and text in the
intersection of the regions covered by the rectangle of each
of the blocks corresponding to the two nodes. If the blocks
rectangles do not overlap in P and G, no edge is added.
Algorithm 4 shows how is built the BCG. If the blocks in P fits
perfectly with the ground-truth blocks G, then the BCG will be
a perfect matching. That is, each node in the two component of
the graph has exactly one incident edge. If there are differences
between the two segmentations, nodes of P or G may have
multiples edges. If there is more than one edge incident to a
node n in P (resp. in G), n is considered oversegmented (resp.
undersegmented). Using these definitions, we can introduce
several measures for evaluating the correspondence of a web
page segmentation algorithm.
Intuitively, if all blocks in G are in P , this means that the
algorithm has a good quality. If one set of blocks in G are

(3)

ij

where
•
•


i ∈ hround( B.rect.x
), round( B.rect.w
)i and,
NP
NP
j ∈ round( B.rect.y
), round( B.rect.h
)
NP
NP

In order to uniformize the importance we define B.importance
as the average importance of a blocks in a segmentation. The
computed importance of each block is divided by the sum of
all the computed blocks importance in a segmentation. Thus
the importance of a block B ∈ WA0 .Blocks is:
B.importance =

computed_importance(B)
P
computed_importance(b)
0 .Blocks
b∈WA

(4)

80

Sección 2: Web Archiving

evaluate(ΦG , ΦP , tr , ti , ND, NP) = (text coverage metric, correspondence metrics)

1
2
3
4
5
6
7
8
9
10

(5)

detected by the segmentation. One example is node 3
shown in the Fig. 6 and the value of the metric is Cm = 1.
5) False alarms Cf (ΦA ), Cf for short. The number of
P nodes that have no match with any in G. This
metric measures how many blocks are “invented” by
the segmentation. For instance, in Fig. 6 node I has no
correspondent in the ground truth making the metric value
as Cf = 1.

Data: nodes ni ∈ G,nj ∈ P
Result: vertex (ni ,nj ) and its weight (if apply)
if ni .rect is contained in nj .rect then
create vertex (ni ,nj );
w(ni , nj ) = ni .htmlcover + ni .textcover;
else if ni .rect contains nj .rect then
create vertex (nj ,ni );
w(ni , nj ) = nj .htmlcover + nj .textcover;
else
/* no vertex is created
*/
w(ni , nj ) = 0;
end
Algorithm 4: Algorithm for Building the BCG Graph

Each metric Cx has a version, noted ICx , that takes the
importance of the blocks into account. In other words, Cx
can be seen as the metric when all the blocks have the same
importance. Cc is a positive measure, Cm and Cf are negative
measures. Co and Cu are “something in the middle”, as they
count “not too serious” errors : found blocks could match with
the ground truth if they were aggregated or split. Note that the
defined measures cover all the possible cases when considering
the matching between G and P .
Thus, the evaluate function returns a vector made of all the
computed metrics, i.e.

grouped into one block in P or if one block in G is divided in
several blocks in P then there is an issue with respect to the
granularity but no error. We determine a segmentation error if
one block in the ground truth is not found in the computed
segmentation or if there are blocks that were “invented” by the
algorithm.
The metrics for block correspondence are defined as follows:

evaluate(ΦG , ΦP , tr , ti , ND, NP) = (T C, Cx , ICx ) (6)

1) Correct segmentation Cc (ΦA ), Cc for short. The number
of one-to-one matches between P and G. A one-to-one
match is defined by a couple of nodes (ni , nj ), ni in P ,
nj in G, such that w(ni , nj ) ≥ tr , where tr is a threshold
that defines how well a detected block must match to be
considered as correct. For instance, in Fig. 6, there is an
edge between node 2 and node B and another one between
node 2 and node C. However, as the weight w(2, C) is
less than tr , and the weight w(2, B) is greater than tr , B
is considered as a correct block. The metric value for the
example is Cc = 2 . Cc is the main metric for measuring
the quality of a segmentation.
2) Oversegmented blocks Co (ΦA ), Co for short. The
number of G nodes having more than one edge. This
metric measures how much a segmentation produced too
small blocks. However, those small blocks fit inside a
block of the ground truth. In the example of Fig. 6, node
6 of the ground truth is oversegmented in the proposed
segmentation. In the example, the metric value is Co = 2
because nodes 6 and 2 are both over-segmented.
3) Undersegmented blocks Cu (ΦA ), Cu for short. The
number of P nodes having more than one edge. The
same as above, but for big blocks, where blocks of the
ground truth fit in. For instance, on Fig. 6, node D of the
proposed segmentation is undersegmented with respect to
the ground truth, and the value for the metric is Cu = 1.
4) Missed blocks Cm (ΦA ), Cm for short. The number of
G nodes that have no match with any in P. This metric
measures how many blocks of the ground truth are not

To evaluate the quality of the segmentation we define a score
Cq , as the total number of acceptable blocks discovered, i.e.
Cq = Cc + Co + Cu and ICq = ICc + ICo + ICu . Note that
Cm is the complement of Cq where Cq + Cm = |G|.
7. P ROPOSED S OLUTION FOR M IGRATION
We propose to segment an HTML4 Web page, with the
appropriate predefined stop condition parameter so that the
resulting blocks will correspond to the semantic tags in the
HTML5 format.
Then we compare the labels found by the segmentation with
a manually labeled segmentation as ground truth. If both
versions are similar the migration is achieved. If they are
different we measure how discrepant they are in order to
determine the causes and the possible actions to improve the
migration method.
Finally, migration is evaluated in order to measure whether it
has affected the rendering of the Web page. We use for this an
adaptation of the framework of Section 6.4.
In the following section we describe the experiments to evaluate
our migration approach.
8. E XPERIMENTS
In this section we present the setup of experiments, their design
and the measures used.

81

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

Figure 6: (a) Ground-truth Segmentation. (b) Computed Segmentation. (c) BCG.

Figure 7: Labels for the Manual and Computed Segmentation

by Brian Solis [17], “The Conversation Prism”. It depicts
the social media landscape from ethnography point of view.
In this work, we considered the five most common of these
categories, namely Blog, Forum, Picture, Enterprise and Wiki.
For each category, a set of 25 sites have been selected using
Google search to find the pages with the highest PageRank.
Within each of those sites, one page is crawled 6 . The GOSH
collection contains 125 pages.

8.1. Experimentation Design
a) GOSH Collection: The dataset holds the offline version
of Web pages, together with their segmentations obtained by
the different algorithms (including the ground truth), organized
in categories.
Within a collection, each page is rendered with different
rendering engines with different predefined stop conditions
values. To each quadruple (page, render engine, algorithm,
predefined stop condition) corresponds a segmentation
performed on that page, and rendered by that engine, using one
algorithm with a predefined stop condition.
Web pages are taken from the GOSH (GOogle SearcH)
collection that we built. It is described as follows.
Web pages in this collection are selected with respect to their
category. This selection is based in the categorization made

b) The MIG5 Collection: The MIG5 collection is a subset
of the GOSH collection presented in previous section. It only
contains Web pages in HTML4 format. We keep the same
categories organization (blog, enterprise, forum, picture and
wiki) in this collection.
6 https://github.com/asanoja/web-segmentation-evaluation/tree/master/dataset

82

Sección 2: Web Archiving

Table I: MIG5 Pages by Categories
Category
blog
enterprise
forum
picture
wiki
total

determine the label of each block. These rules assign labels
depending on the position of a block and its relationship to the
others blocks. A block is treated differently if it resides in the
visible part of the page (i.e. the part of the page visible without
using scrolling). For instance, a block is labeled as header if it
is the first block found vertically (on top of the page), it resides
in the visible part of the page, it is a simple block and it has
siblings. A block with the same characteristics but outside of
the visible area and at the bottom of the page is labeled as
footer.
For the labels section and nav, two additional conditions are
considered. If the proportion of elements a block covers is
greater than a constant, it can be considered a section. If the
proportion of hyperlinks (i.e. <A> elements) a block covers is
greater than a constant, it can be considered a nav. Algorithm
5 describe the label assignment method for all possible cases.

Pages
5
9
14
7
5
40

c) Experiments: The first experiment is devoted to measure
to what extent the labels found with the Block-o-Matic
segmentation algorithm match to those in a ground truth
of manually labeled blocks.
The second experiment aims of measuring if including the
semantic elements affects the rendering of the page. The block
correspondence method, as presented in Section 6.5, is used for
evaluating the correctness of the migration. The segmentation
of the original Web page is used as a ground truth, while
the segmentation of the migrated Web page is the evaluated
segmentation.

8.5. Measuring Labels
The manual segmentation ΦG and the computed segmentation
ΦP are formal defined in Section 3. The manual segmentation,
produced by assessors, takes the rendered DOM of a Web page
(W ) in HTML4 file format and produces the WG0 block graph.
The computed segmentation takes the same rendered DOM
(W ) and produces the WP0 block graph.
We present the labels of a segmentation as a list of labels
(labels(WA0 )).
Using the intersection of both list we get the amount of correct
labels found by the segmentation with respect to the ground
truth. The correct_labels measure is defined as:

8.2. Manual-design-Of-Blocks Tool
In order to build a Web page segmentation grountruth
we develop the tool MoB (Manual-design-Of-Blocks). It is
conceived as a browser extension and expose functionalities to
expert users for creating a manual segmentations 7 .
Users can create blocks based on Web page elements. They can
merge blocks, navigate into the element hierarchy to produce
a block graph 8 (c.f Section 2), or produce a flat segmentation
(i.e. leaves in the block tree). These segmentations are stored
in a repository 9 for the evaluation.

correct_labels(WG0 , WP0 ) = labels(WG0 ) ∩ labels(WP0 )
Figure 7, shows the labels for the manual and computed
segmentation. The list of labels from the manual segmentation
is: { header, nav, aside, article, aside, article, footer}. The list
of labels for the computed segmentation is: { header, aside,
article, aside, article, footer}. For simplicity, we denote the
labels with one letter. Thus, the list of labels for both example
segmentations are:

8.3. Ground Truth Building
Table I shows the organization of the MIG5 collection. It is
composed of 40 pages organized by category.
The MoB tool (cf. Section 8.2) is used to annotate the blocks.
Besides specifying the blocks, assessors assign a label to each
block. Labels corresponds to a subset of the semantic elements
defined in the HTML5 specification (header, footer, section,
article, nav, aside). The stop condition for all the experiments
is set to pA = 6. Indeed, through experiments, we noticed that
this value generates blocks likely to correspond to template
elements. The separation is set to pD = 30 because usually
these regions can be very close one to each other.

•
•

labels(WG0 ) = {H, N, D, A, D, A, F }
labels(WP0 ) = {H, D, A, D, A, F }

The migration of Figure 7 is not perfect since the segmentation
did not find the block labeled as nav. Instead, it found the
block labeled as header covering the corresponding region of
the page. We measure this error with the Levenshtein distance
[9].

8.4. Assigning Labels

error(WG0 , WP0 ) = LD(labels(WG0 ), labels(WP0 ))

The BoM labeling method is modified to support the semantic
elements as labels. Heuristics rules are defined in order to

where LD is the Levenshtein distance. For the example the
error is 1: it is sufficient to insert 1 label (N) in the computed
segmentation label list to produce the list of the ground truth.
We represent also the results in terms of precision and recall:

7 http://www-poleia.lip6.fr/

sanojaa/BOM
a tree
9 http://www-poleia.lip6.fr/ sanojaa/BOM/inventory
8 Usually

83

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

1
2
3
4

5
6
7
8
9
10
11
12
13
14
15

16
17
18
19
20
21

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

measures defined in Section 6.5. We do not consider the metric
version with importance.
We have two rendered DOM, W and W5, where W is the
rendered DOM of a Web page in HTML4 format and W5
is the rendered DOM of the migrated Web page. They
respectively produce the blocks graphs WP0 and W 50P . Setting
the parameters tr = 0, ti = 0, N D = 100 and N P = 10 we
get the correspondence measures. We choose these parameters
because we want to evaluate all blocks, so we consider all as
significant and all are equally important.
If we find only correct blocks then the migration may be
perfect, if both segmentations produce the same segmentation
there is a high probability that their rendering is the same. If an
oversegmentation or an undersegmentation occurs that means
that the inclusion of semantic elements in W 5 modified the
size and position of the blocks, therefore segmentations are
different. Blocks missed and false alarms are possible when the
rendering changes, slightly displacing content in the migrated
version.

Data: Block: b
Result: B.label
if b.weight > pA then
if b in the visible part of page then
if b is the first block on top then
if proportion of elements covered by b is greater
than a constant then
if b is composite then
B.label=SECTION;
else if b has no siblings then
B.label=SECTION;
else
B.label=HEADER;
end
else
B.label=HEADER;
end
else if proportion of elements covered by b is greater
than a constant then
if b is composite then
B.label=SECTION;
else
B.label=ARTICLE;
end
else if proportion of hyperlinks covered by b is
greater than a constant then
B.label=NAV;
else if b is in the middle/center of the page then
B.label=ARTICLE;
else if b is the last block at bottom then
B.label=FOOTER;
else if b is at left/right of the page then
B.label=ASIDE;
else
B.label=ARTICLE;
end
else if b is the last block at bottom then
B.label=FOOTER;
else
B.label=ARTICLE;
end
end
Algorithm 5: Label Assignment Algorithm

precision =
recall =

9. R ESULTS
In this section we present the results of applying our approach
to migrated Web pages from HTML4 format to HTML5
format. We present how we measure the labels found by the
algorithm compared to the ground truth and the rendering
errors using the evaluation model presented in Section 6.4.
a) Measuring Labels: Table II shows the average values
of the metrics defined in Section 8.5 for the MIG5
collection separated by categories. Column CL1 represents
the correct label measure (correct_labels(WG0 , WP0 )). The
CL2 column represents the amount of labels in a segmentation
(|labels(WG0 )|). The CL3 column represents the rendering
error (error(WG0 , WP0 )). The last two columns represents de
precision and recall measures.
In general BoM produces a list of labels similar to the ground
truth. In average it adds 1.85 unexpected labels. This is
probably due to the introduction of semantic elements that
affect the segmentation and the stop condition, producing
smaller blocks than expected. For instance, for a blog post with
two paragraphs, labeled as a whole in the ground truth, each
paragraph become a block in the migrated page generating one
additional unexpected label. It is interesting that both rendering
looks equal but the segmentations differs.
Forum category presents the lowest error rate, because in
general the question/response region of the page is detected in
both segmentation, as one block labeled as article. The worst
performance is for the enterprise category, because this type
of pages are structured with complex navigation and main
content, and the probability of mislabeling is high.
Table II shows the precision and recall metrics. Figure 8 shows
these metrics graphically. The BoM algorithm has a high
precision for the forum and picture categories. As we mention
earlier both type of pages produces small and simple list of

correct_labels(WG0 , WP0 ) + |labels(WG0 )|
|labels(WG0 )|

correct_labels(WG0 , WP0 ) + |labels(WG0 )|
correct_labels(WG0 , WP0 )

8.6. Measuring Rendering Errors
In order to measure to what extent the migration affects the
rendering of the migrated Web page, we use the correspondence

84

Sección 2: Web Archiving

Table II: Average Values for Correct, Expected Labels and Error for
the MIG5 Collection
category
blog
enterprise
forum
picture
wiki

CL1
2.50
2.22
3.00
2.55
2.20

CL2
3.50
3.55
3.53
3.00
3.00

CL3
2.00
2.38
1.44
1.55
1.90

prec
0.28
0.37
0.14
0.14
0.26

format. Using the segmentation, we produce a migrated version
according to the HTML5 specification. We analyzed how the
algorithm assigned labels to blocks in comparison to a ground
truth of manually labeled segmentation. The rendering errors
were measured using the block correspondence metrics defined
in Section 6.5. The results show that, in the context of digital
preservation, migrating Web pages from one format to another
is possible using the BoM Web page segmentation algorithm,
minimizing the emulation in Web archives. We show that there
is no data loss in the process and no important changes in
the rendering (few false alarms). However the segmentation
is affected by the semantic tags. For instance, some browsers
have no default style for these elements, and they are taken
by the algorithms as invisible or not valid elements, therefore
they are ignored. The evaluation model presented in Section
6.4 is very helpful to measure the performance and detecting
the rendering errors. The parameters and the stop conditions
of the algorithm can be adjusted by category (using Machine
Learning techniques) to have better performance depending on
page category. This is left as future work.
This work focus on the migration of the rendered version of a
Web page, however as a future work it is interesting to include
into the analysis other components of the Web pages such
as Javascripts, CSS1 and CSS2. We need to assure that all
dependencies of the migrated version and its accessibility are
according to the new format.
There are still challenges to overcome. Our approach gives
insights of the upcoming issue raised by the migration of Web
content in the context of Web preservation.

rec
0.4
0.60
0.17
0.17
0.36

Table III: Correspondence Metrics for the MIG5 Collection with
tr = 0.1 and tt = 1
Algorithm
blog
enterprise
forum
picture
wiki

Cc
6.50
4.00
3.41
2.71
6.00

Co
0.50
0.33
0.59
1.00
0.0

Cu
0.00
0.33
0.41
0.29
0.00

Cm
0.00
1.11
2.11
2.00
0.60

Cf
0.50
2.77
1.29
0.71
0.40

Cq
7.00
4.67
4.41
4.00
6.00

GT B
7
6.45
6.59
6.71
6.6

labels, while pages in the other categories their labeling is
more complex, therefore less precision. However, all results
present high recall values indicating that the algorithm find
enough good labels but with a considerable error rate.

R EFERENCES
[1] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting Content Structure
for Web Pages Based on Visual Representation. In Proceedings of the 5th
Asia-Pacific Web Conference on Web Technologies and Applications,
APWeb’03, pages 406–417, Xian, China, 2003. Springer-Verlag.
[2] D. Chakrabarti, R. Kumar, and K. Punera. A Graph-Theoretic Approach
to Webpage Segmentation. In Proceedings of the 17th international
conference on World Wide Web, pages 377–386, Beijing, China, 2008.
ACM.
[3] J. Garret. Preserving Digital Information. Technical report, Commission
on Preservation and Access and the Research Libraries Group, 1996.
[4] A. N. Jackson. Formats Over Time: Exploring UK Web History. CoRR,
abs/1210.1714, 2012.
[5] C. Kohlschütter and W. Nejdl. A Densitometric Approach to Web
Page Segmentation. In Proceedings of the 17th ACM conference on
Information and knowledge management, pages 1173–1182, New York,
NY, USA, 2008. ACM.
[6] B. Laws. Seriously, Another Format? You Must Be Kidding. CSE NEWS,
vol. 36, no. 2. pp. 41, 2013.

Figure 8: Precision and Recall for the MIG Collection

b) Measuring Rendering Errors: Table III shows the
average correspondence metrics, by category, for the MIG5
collection. The values of the Cq metric shows that the
performance of the algorithm in both versions (original and
migrated) is good. However, there are some missed blocks,
particularly in the enterprise, forum and picture categories
because of shifting of blocks due to rendering changes. But in
both cases, the formatted content displayed is equal. Blog and
wiki categories present the best performance. The regions in
these type of pages are simple and the position and order of
blocks are standard. The regions are well separated, making
it easy to segmentation algorithms like BoM to detect correct
labels. For instance, almost all pages in this categories start by
a header followed by a navigation, then the aside at left, the
main article and the footer at the bottom of the page.

[7] X. Liu, H. Lin, and Y. Tian. Segmenting Webpage with Gomory-Hu Tree
Based Clustering. Journal of Software, vol. 6, no. 12, pp. 2421–2425,
Dec 2011.
[8] J. U. Mahmud, Y. Borodin, and I. V. Ramakrishnan. Csurf: A ContextDriven Non-Visual Web-Browser. In Proceedings of the 16th International
Conference on World Wide Web, WWW ’07, pages 31–40, New York,
NY, USA, 2007. ACM.
[9] G. Navarro. A Guided Tour to Approximate String Matching. ACM
Comput. Surv., vol. 33, no. 1, pp 31-88, Mar. 2001.
[10] Z. Nie, J.-R. Wen, and W.-Y. Ma. Webpage Understanding: Beyond
Page-Level Search. SIGMOD Rec., vol. 37, no. 4, pp. 48-54, mar 2009.

10. P ERSPECTIVES AND O UTLOOK
In this section we presented our approach to block-based
migration of Web pages from HTML4 format to HTML5

85

IV Simposio Científico y Tecnológico en Computación / SCTC 2016 / ISBN: 978-980-12-8407-9
Universidad Central de Venezuela, Caracas, Venezuela - 09 al 11 de mayo de 2016

[11] S. H. Park, N. Lynberg, J. Racer, P. McElmurray, and E. A. Fox. HTML5
ETDs. In Proceedings of International Symposium on Electronic Thesis
and Dissertations, Austin, TX, USA, 2010.

[18] R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning Block Importance
Models for Web Pages. In Proceedings of the 13th International
Conference on World Wide Web, WWW ’04, pages 203–211, New York,
NY, USA, 2004. ACM.

[12] Z. Pehlivan, M. Ben-Saad, and S. Gançarski. Vi-diff: Understanding Web
Pages Changes. In Proceedings of the 21st International Conference
on Database and Expert Systems Applications: Part I, DEXA’10, pages
1–15, Berlin, Heidelberg, 2010. Springer-Verlag.

[19] Y. Y. Tang and C. Y. Suen. Document Structures: A Survey. International
journal of pattern recognition and artificial intelligence, vol. 8, no. 5, pp.
1081-1111, 1994.

[13] S. Pfeiffer. The Definitive Guide to HTML5 Video. Apress, Berkely, CA,
USA, 1st edition, 2010.

[20] J. Van der Hoeven. Emulation for Digital Preservation in Practice: The
Results. The International Journal of Digital Curation, vol 2, no. 2, pp.
123-132, Diciembre 2007.

[14] D. S. H. Rosenthal, T. Lipkis, T. Robertson, and S. Morabito. Transparent
Format Migration of Preserved Web Content. D-Lib Magazine, vol. 11,
no. 1, 2005.

[21] A. S. Vargas. Web Page Segmentation, Evaluation and Applications.
PhD thesis, Université Pierre et Marie Curie-Paris VI, 2015.

[15] M. B. Saad and S. Gançarski. Using Visual Pages Analysis for Optimizing
Web Archiving. In Proceedings of the 2010 EDBT/ICDT Workshops,
EDBT ’10, vol. 7, no. 43, pp. 1-43, New York, NY, USA, 2010. ACM.

[22] O. Wu, Y. Chen, B. Li, and W. Hu. Evaluating the Visual Quality of
Web Pages Using a Computational Aesthetic Approach. In Proceedings
of the Fourth ACM International Conference on Web Search and Data
Mining, WSDM ’11, pages 337-346, Hong Kong, China, 2011. ACM.

[16] F. Shafait, D. Keysers, and T. Breuel. Performance Evaluation and
Benchmarking of Six-Page Segmentation Algorithms. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, vol. 6, no. 30, pp.
941-954, 2008.

[23] Y. Xiao, Y. Tao, and Q. Li. Web Page Adaptation for Mobile Device. In
Wireless Communications, Networking and Mobile Computing, 2008.
WiCOM ’08. 4th International Conference on, pages 1–5, Dailan, China,
2008.

[17] B. Solis. The Conversation Prism, 2014. https://conversationprism.com.

86
View publication stats