Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N11 2017 year

DOI: 10.17587/prin.8.490-503
Methods and Means for Monitoring Publications in Mass Media
V. A. Vasenin, vasenin@msu.ru, Moscow State University, Moscow, 119234, Russian Federation, M. D. Dzabraev, dzabraew@gmail.com, ISTINA Information System, Moscow, 119192, Russian Federation
Corresponding author: Vasenin Valery A., Professor, Moscow State University, Moscow, 119192, Russian Federation, E-mail: vasenin@msu.ru
Received on August 21, 2017
Accepted on September 07, 2017

Currently various applications encounter a task of extracting various data from the web sites. The first step to solve this task is extraction of all URLs from the web site being analyzed. Modern web sites are usually interactive, where interactivity means that the site can listen to events generated by user and respond to them. The most important event is clicking the left mouse button. To obtain the most complete collection of URLs one should click a certain sequence of buttons on the web page, which may cause a new block containing new URLs to be dynamically inserted. In other words, to obtain the most complete collection of URLs one should develop an algorithm that emulates actions of a user. This article presents model views, algorithms and software that emulates mouse clicks by a user. It also presents a business process model for algorithms and software using which one may automatically navigate within a page. Web pages may contain buttons of two types: clicking on the button either opens a new page, or modifies the current page by evaluating JavaScript. The navigating algorithm ignores the first type and only deals with the second type of buttons. The algorithm presented in this articles is intended to work with the following assumptions. The implementation should automatically detect the buttons of second type on the page and automatically choose a sequence of buttons to be clicked on. The algorithm takes into account that as a consequence of clicking old buttons may disappear and new buttons may appear. If some button was present and then disappeared without being clicked, the memory of the implementation will contain a path using which the implementation will later come into state when this button was present and click it.

Keywords: data extraction, web, readability, web-site traverse, web-page traverse, Javascript, Firefox
pp. 490–503
For citation:
Vasenin V. A., Dzabraev M. D. Methods and Means for Monitoring Publications in Mass Media, Programmnaya Ingeneria, 2017, vol. 8, no.11, pp. 490—503.