Oct 12 2011

Improve FFDA of Web Servers via a Rule-Based Logging Approach

Log files happen to be exploited by FFDA to do system evaluation and analysis within several domains (e.g., , , , ). The amount of trust on FFDA results is strongly associated with the quality of logged information. Moreover, log production is really a developer-dependent , error-prone process. It’s left towards the late stages from the development cycle also it lacks an organized approach.

Arbitrary semantics, duplicated or misleading data, and also the lack of important error data compromise ale identifying actual failures by collected logs. Issues raised by traditional logging are exacerbated when it comes to complex systems. These usually integrate many software items (e.g., operating-system, middleware layers, and applicative components) executed inside a distributed fashion. The possible lack of a standardized logging solution across vendors makes them items log by utilizing specific formats and with no form of cooperation.

The resulting field databases are thus heterogeneous, inaccurate and highly redundant , , . Several works address logs issues by proposing manipulation techniques (e.g., filtering, coalescence) to become applied to the filed databases as it is. Pre-analysis processing is essential to reduce the quantity of useless information . Anyway logs may miss important failure data , .

Our knowledge about the popular Apache Web Server causes us to be claim that missing log entries can impact FFDA results much more than useless or redundant ones. Actually we experienced that about 6 from every 10 actual failures don’t leave any trace in Apache logs. A complete failure databases is the key to do effective FFDA. We declare that this can be achieved only by improving log production instead of analysis techniques.

Our driving idea would be to devote a design effort to logs. We propose a three-phases approach (i) to define evidences the analyst expects from logs (ii) to create what and where you can log to offer the required evidences throughout the system operational time (iii) to offer design leads to terms of logging rules that must be followed at coding some time and which guarantee exhaustive logs. They are intended to support FFDA analyses and also to ease third-party processing (e.g., error-data extraction, coalescence). The paper is definitely the principles underlying our proposal and shows its benefits poor a practical knowledge about the Apache Web Server.

The aim of this experimentation is twofold (i) to judge logging capabilities of the widely used software system (ii) to check the results with those achievable using the proposed approach. For this aim, we instrumented the Apache Web Server to use a set of logging rules created by focusing on timing failures (e.g., component crash or hang ).

Results reveal that the proposed approach significantly increases logs coverage regarding this type of failures. All of those other paper is organized the following. After describing related work (Section II), we present our preliminary log design experience (Section III). Then we describe the support tool enabling on-line failure data generation (Section IV) and also the experience with the Apache Web Server (Sections V and VI). Section VII concludes the job. II.

The past few years have been seen as a proliferation of FFDA studies on several classes of systems, for example commodity os’s , and supercomputers , . These studies typically adopt log files because the primary supply of failure data. The increasing usage of FFDA for dependability analysis also encouraged the realization of software programs for automating FFDA phases (e.g., data collecting, data coalescing and analysis). A good example is represented by MEADEP , providing an information pre-processor for converting data in a variety of formats, along with a data analyzer for graphical data-presentation and parameter estimation.

In a tool for on-line log analysis is presented. It defines some rules to model and correlate log events at runtime, resulting in a faster recognition of problems. The phrase rules, however, still depends upon the log contents and analyst skills.

Even though it is clear that log files can offer useful and detailed understanding of the dependability behaviour of real-world systems, several works underline logs deficiencies, for example inaccuracy (e.g., non signalled reboots in ), or provision of ambiguous information , . Recent contributions began to address these problems. A proposal for any new generation of log files are supplied in , where several recommendations are brought to improve log expressiveness by enriching their format. A metric can also be proposed to measure information entropy of log files, to be able to compare different logging solutions.

Another proposal may be the IBM Common Event Infrastructure , introduced mainly in order to save the time required for root cause analysis. It provides a consistent, unified group of APIs and infrastructure for that creation, transmission, persistence and distribution of log events, formatted based on a well defined format. To conclude, all recent research efforts on log-based dependability analysis mainly address format heterogeneity issues, while crucial decisions about log semantics and production continue to be left to developers, late within the development cycle. III. RULE-BASED LOGGING We discuss the important thing aspects of our logging approach, already described in and the way it has been put on our example.

After presenting the adopted system model we describe the principles underlying our proposal. A. System Model One is used to explain the main aspects of a system combined with the interactions included in this. This is a technology-independent way (i) to know how to place logging mechanisms inside the source code and (ii) to pursue a proper mean to prove the potency of the logged information. Our proposal considers two types of components, i.e., entities and resources. The first kind is an active unit (i) encapsulating executable code (e.g., an item module, a shared library) (ii) providing services.

The second is a passive unit, like a file, a shared memory, along with a socket. Entities allow complex services by way of interactions along with other entities or resources from the system.

Proposed definitions provide general concepts, which may be tailored to designers’ needs. For example, an entity may model an entire OS process or package of code, independently from the process executing it. Furthermore designers can deliberately make a decision to leave un-modelled areas of the system thus concentrating on specific entities (e.g., applicative ones).

Approach The proposed approach includes three phases (Fig.1). What’s needed specification addresses the question ”which evidence does the analyst expect from logs?”. Throughout the design phase we investigate what and where you can log to offer the identified evidences throughout the system operational time. Within the last phase loggin rules are supplied.

This is the way design answers are made available to developers. For this aim each rule precisely defines what and where you can log also it prevents ambiguous and unstructured events. Approach overview. Within this paper the main focus is on timing failures. We try to design logs, which will make it possible find proof of this type of failures during operations. Timing failures would be the result of the unexpected suspension/termination from the program control flow, including infinite loops triggering. You’ll be able to detect timing failures by placing logs at specific points inside the execution flow from the system entities. Particularly, we concentrate on services and interactions by logging (i) something start (SST) and repair end (SEN) event at the start and at the finish of a service, respectively (ii) an interaction start (IST) and interaction end (IEN) event immediately pre and post an interaction, respectively.

Fig.2 sketches how these events may be used in the context of a C or C program. Using logging rules. When the control flow is modified with a fault (e.g., a poor pointer manipulation) triggered by service_1 (Fig.2 A) SEN will probably miss. If interaction_2 (Fig.2 B) fails (e.g., by never giving back the control, as with case of the hang within the called) we could find it out by logs, since IEN is missing. Not one other instructions need to exist between IST-IEN. Table I summarizes logging rules coping with timing failures. We extract failure data by our rule-based logs throughout the system operational time. On-line processing is enabled by sending logs on the delivery broker named Log-Bus, which eases the range tasks in the event of distributed systems.

Logging and Analysis Infrastructure

Timing failures could be detected by processing in pairs the beginning and end events associated with the same service or interaction. To alleviate the processing task they are logged jointly having a unique key. For every entity, on-agent keeps constantly updated the expected amount of the time between your start and end events of every service or interaction. For example, let us consider service 08 and interaction 15 and represent their expected duration, respectively.

On-agent internals. An effective timeout is tuned for every duration. Let _08 and _15 (Fig.4) be these timeouts. If your service or interaction exceeds its currently estimated timeout, a mistake is raised, otherwise _ and _ estimates are updated. Two kinds of error are generated through the on-agent tool: _ Interaction Error (InE): it’s generated when IST isn’t followed by the related IEN inside the currently estimated timeout; _ Computation Error (CoE): it’s generated when SST isn’t followed by the related SEN inside the expected timeout with no interaction errors happen to be raised.

A mistake log is permanently stored for every raised error. Collected logs are then supplied towards the analysis phase. V. EXPERIMENTS We present our preliminary knowledge about the Apache Web Server1 version 1.3.41. We try to evaluate logging capabilities each of traditional logging as well as the proposed approach. This is accomplished by conducting a software fault injection campaign, to be able to force failures occurrence, by evaluating the policy of the logs, i.e., the share of failures actually observed on the internet Server that an evidence can be found in the logs.

Software Fault Injection We inject software faults by way of changes in the origin code from the program. Changes are introduced based on fault-operators based upon actual faults uncovered in a number of open-source projects [16]. Examples would be the “missing function call” operator (OMFC) and also the “missing variable initialization utilizing a value” operator (OMVIV). A complete list of fault operators are available in [16]. A single fault is introduced within the source code for every experiment. Code is compiled and also the faulty Web Server version is stored for that experimental campaign. A support tool2 continues to be developed to automate the fault injection process. We inject 8,433 software faults in the primary Web Server source code (i.e., /src/main folder). Table II reports the experiments breakup both by fault operator by source file.

Web Server Modelling

We identify 6 entities encapsulating the primary processing items composing the net Server. All of them is mapped to the following source files, respectively (i) http_protocol.c (ii) http_main.c (iii) http_request.c (iv) http_config.c (v) http_core.c (vi) http_vhost.c. We identify services and interactions included in this and we instrument the code based on the proposed rules. We don’t model the entire Web Server, even when we execute a full injection campaign in the code (Table II). This is accomplished to show that (i) the analyst can freely chose just the entities he/she has an interest in (ii) the proposed rules enable failures to become logged even when a fault is activated in a un-modelled item.

Experimental campaign We deploy a 2-hosts testbed to do the campaign. Fig.5 depicts the involved components. The customer Machine hosts (i) on-agent producing our Rule-Based (RB) error logs and, (ii) httperf i.e., the Web-Server workload generator. We configure httperf to be able to exploit the majority of the features provided by the Web Server (i.e., virtual-hosts, multiple methods and file extensions, cookies). Testbed The Server Machine hosts (i) the present faulty Web Server and (ii) the exam manager program. The exam manager coordinates the experimental campaign (Fig.5). For every experiment it (1) starts on-agent (2) starts a faulty Web Server version (3) starts httperf (4) stops the constituents after a proper timeout (i.e., of sufficient length to enable workload to accomplish) and collects experiments data. Experiments outputs include produced logs (both RB and Apache error logs) along with a label summarizing the experiment outcome with regards to the web server behaviour seen in response to the injected fault.

The final results are classified as follows: _ Crash: unexpected termination from the Web Server; _ Hang: a number of the HTTP requests, or even the Web Server start/stop phases, aren’t executed inside the timeout; _ Other: all error problems that are not classifiable as crash or hang (e.g., non-timing failures, for example wrong values sent to the client); _ No failure: all of the requests given by the workload generator are correctly executed. VI. RESULTS Throughout the campaign 1,386 (i.e., 101 hangs, 744 crashes, 541 other) from 8,433 experiments create a failure outcome. The Apache logging mechanism results in 615 out of 1,386 logged failures.

The policy is about 44.4%. Fig.6 depicts how coverage varies with regards to the outcomes. Apache error logs often lack entries in the event of hang and crash failures. Actually only 11.9% of hangs and 37.5% of crashes are logged. Other failures are instead mostly logged (i.e., 59.9%). This really is due to the inherent incapacity of traditional logs at providing proof of timing failures. Clearly, following a crash or hang it’s not possible to log any entry.

Simultaneously, there are low chances the evidence of an imminent crash or hang are available in the log. However, non-timing failures, for example value failures, tend to be more often because of errors that are detected within the code, after which logged. Coverage breakup by failure type (Apache) Overall the proposed RB logs result in a higher coverage: 849 from 1,386 failures are logged, hence the policy is about 61.3%. Fig.7 depicts how coverage varies with regards to the outcomes. As opposite from traditional logging, the majority of hang and crash failures are logged (i.e., 81.2% and 87.2% respectively), whereas only 21.8% of other failures are logged.

This can be a result of our design-based approach. RB logs happen to be designed specifically by concentrating on timing failures. Consequently, the majority of timing failures are logged, even when only a fraction from the source code continues to be instrumented according to the proposed rules. For the similar reason, the policy of other kinds of failures is low if when compared with timing failures.

Coverage breakup by failure type (Proposed Approach) To summarize, we perform an in-depth comparison between Apache and RB logs. This is accomplished by splitting the quantity of each failure outcome into 4 classes: logged both by RB and Apache, not logged by RB but logged by Apache, logged by RB although not logged by Apache and, not logged by both RB and Apache (i.e., RB^Ap, !RB^Ap, RB^!Ap, !RB^!Ap, respectively).

Once we expect, the majority of timing failures could be logged using the RB approach only. Apache logs (Fig.8) provide further evidence only in 4.9% and three.1% of cases (hangs and crashes respectively). Our design-based approach significantly boosts the amount of logged timing failures, 69.3% and 49.7% for hangs and crashes respectively, thus potentially resulting in more effective FFDA results. However, Apache logs represent a much better source for other kinds of failures, for example value failures. It’s also interesting to notice that Apache and RB logs are almost complementary at covering non-timing failures (only 2.4% of “other” failures are handled by both Apache and RB logs, i.e., there’s a small intersection).

This means that that a combined approach could boost the overall coverage capacity from the logging mechanism. VII. CONCLUSION This paper depicted the important thing elements of a rule-based logging approach targeted at overcoming the well-known limitations of traditional logging. Proposed rules cantered on timing failures, for example crashes or hangs.

Experimental results around the widely used Apache Web Server show that (i) RB logging creates a higher overall coverage regarding traditional logging; (ii) RB logs exhibit a great coverage particularly with respect to timing failures whereas traditional logs are better on other kinds of failures; (iii) RB and traditional logs are complementary regarding other types of failures. A combined approach is promising to improve the coverage from the logging mechanism. This is often also achieved by extending our group of rules.

After this objective, future work is going to be devoted to the phrase logging rules made to target non-timing failures, for example value failures. Also, methods and methods will be investigated to prevent manual log production by enabling tools (e.g., by utilizing model-driven techniques) to automate logging-code writing.