Lightweight Databases
Introduction
The internet (WWW) allows hypertext documents to become created and browse by users over the Internet. A lot of information is available these days at the various WWW sites, and appears to be growing exponentially. Moreover the net is being cross-linked make it possible for users to locate their way round the available information. Although WWW documents might be useful to human readers, there’s often a have to perform automated queries on information.
Typical these include searching for contact details, determining costs of services, re-formatting existing documents, searching indexes etc. The ability to perform predicate queries on the body of knowledge is important for resource discovery and handling the information overload which WWW engenders. Current WWW technologies don’t provide adequate support for such automated use of information. Many authors have proposed the integration of relational database anagement systems (RDBMSs) with WWW, although they have concentrated on interactive instead of automated query submission.
We now have investigated an alternate approach by which web pages are marked-up with semantic information based on an underlying database schema. We range from the database structure explicitly in to the web, allowing queries to become performed client- or server-side. While not suitable for applications with highly intensive database requirements, such “lightweight” database webs allow modest database functionality to become obtained effortlessly and with no overhead of utilizing a full RDBMS. In this paper we present our approach and discuss it in relationship with other work on database integration and semantic mark-up.
We describe an example application of the strategy that of “flattening” a document with hyperlinks to create a paper copy.
Semantic Mark-up
A document is really a source of information for any reader. An average document is going to be structured into headings, sections, tables etc, each of which could have useful information. SGML[SoftQuad] allows a document’s structure to become completely divorced from the presentation. HTML[Berners-Lee94] is stiffer in this respect, for the reason that presentational information is implicit within the tags accustomed to mark-up a document. Although a document makes information open to the human reader, making exactly the same information available to automatic processing tools requires additional effort.
For instance, although my phone number is immediately available on the internet, an automatic tool could be hard-put to distinguish it from my fax number. Human readers are much more tolerant than machines.
Something must be in a position to determine the semantic content of a part of a document and extract the appropriate information (without extraneous formatting). It should be able to navigate around an internet to locate the necessary information – preferably without lengthy searching – and cope with relationships between bits of information.
Techniques already exist inside the database world for formally describing interconnected information. Probably the most widely-accepted technique is to make use of the relational model[Date86] to structure information. The data may then be accessed by posing queries in relational algebra. We now have investigated importing the relational formalism into WWW using a minimal extension to HTML. This enables elements of a document to become identified with aspects of an underlying relational database.
The data is still stored directly within the web, using the database structures offering an alternate view to the data.
Notation
In overview, the approach is really as follows: we first choose the information which you want to publish in machine-readable form, and execute a standard data analysis onto it. The result of this can be a conceptual model of the data as a assortment of entities having attributes and connected by relations.
Then we mark-up information within the web pages based on this schema – observe that it is not necessary to create a logical type of the information as tables as will be the case for any standard relational database. The notation for mark-up extends HTML with three extra elements:
??<ENT>…</ENT> identifies an entity of the particular class
??<ATTR>…</ATTR> identifies a named attribute of the entity
??<REL>…</REL> denotes rapport between entities
These components directly capture the conceptual-level structure of knowledge, allowing individual entities to encapsulate several sub-values as attributes and allowing entities to become linked using meaningful connectives.
No elements has any presentational content – they offer meta-information about a part of a document but don’t alter its presentation. They might thus be prevented by browsers presenting a document to some user. The overhead in adding such mark-up to some page is usually very small, and thus does not effect on performance.
The sun and rain turn an easy web right into a lightweight database. They permit a database formalism to become imported straight into web pages, which allows a database-aware client application to extract information from the pages based on the underlying schema. We shall explain using the mark-up by way of an example below. Further information on the notation might be found in [Dobson94a].
Mark-up Example: Project Descriptions
As one example of the use of the notation, we shall mark-up a webpage which describes an investigation project. Inside our own Department all on-going projects have descriptions within the web. An average project page is shown in figure 1. Each project features a number of useful bits of information, including:
-an extended title, short title and logo;
-an intro and the project’s aims;
-any collaborating organisations;
-the funding source, total funding and RAL’s share from it; and
-a message name for more information.
We begin by defining an appropriate conceptual model for any project. Since our projects are extremely similar, you’ll be able to derive a canonical model which encapsulates the core information: this still allows individual projects to incorporate extra info on their pages if desired. The model is shown in figure 2 – observe that each entity class includes a primary key (shown by underlining) which identifies each instance uniquely inside a class.
Each project is represented like a single entity from the project class. These kinds has features of long and short names, logo etc. We mark-up the top-level project entity the following:
<ENT KEY=mips> <H1> <ATTR NAME=logo><IMG SRC=”MIPS.gif”></ATTR> The <ATTR NAME=shortname>MIPS</ATTR> Project </H1> … </ENT>
The ENT element offers the primary key value and entity class from the information. Significant bits of information are captured within ATTR elements, which identify attributes by name. The page itself contains extra formatting information, like the use of H1 for that heading, but this isn’t significant (from the database perspective) and lies away from scope from the identified attributes. Entities are linked using relations. Within our example, the MIPS project is related to a particular person underneath the has_contact relation. We might encode this as:
<ENT KEY=mips> … <REL NAME=”has_contact” KEY=mdw HREF=”/people/mdw.html”> <A HREF=”/people/mdw.html”>Michael Wilson</A> </REL> … </ENT>
The REL element relates the containing entity to a different entity recognized by key value and entity class within named relationship. The HREF attribute provides location information, indicating the URL where the target entity might be found. This simplifies searching, like a search engine may move straight to the target from the relationship.
We regard this like a useful navigational hint, less an essential area of the mark-up. Notice that the REL element includes a denotation of the relationship, which in this instance is a hyperlink. You should realise the relationship structure of the lightweight database is distinct in the hyperlink structure of the document.
You can form relationships between entities without needing hyperlinks, or hyperlinks without invoking rapport (in the database sense). This is particularly useful whenever we want to provide information within the database which isn’t directly accessible with a human browser: rapport without a corresponding hyperlink is effectively invisible, although a databaseaware client could still traverse the connection to acquire the data.
We may also desire to relate entities that are stored in exactly the same document but that are logically distinct. For instance, the MIPS project’s introduction is really a section entity presented combined with the project’s top page. This relationship might be encoded by nesting an entity element directly inside a relationship element:
<ENT KEY=mips> … <REL NAME=”has_introduction”> <ENT KEY=”mips_introduction” CLASS=introduction> Most individuals are… </ENT> </REL> … </ENT>
Again, the database structure from the information is outside of the way it is presented when it comes to in-line inclusion and hyperlinks. You can see the information to be stored in a means which is easiest for the most common mode useful (browsing) whilst remaining available to other modes (database queries).
Application Example: Printing Hypertext Documents
Our motivations for that lightweight database extensions would be a need to generate printed “hand-outs” from hypertext project descriptions for distribution to your customers without web access. We now have developed a small server-side application which uses the lightweight mark-up to “flatten” a document for printing.
The printing application consists of three parts: an interpreter for that semantic markup, a query engine, along with a template post-processor. The very first stage interprets a light-weight database web to create a table of entities, attributes and relationships. This table will then be interrogated through the query engine to extract the specified entities or attributes, using general database-style queries. Templates are utilized to re-format extracted elements into HTML for presentation towards the user.
A template is definitely an HTML document augmented with queries serving as place-holders for the outcomes of database accesses. An average query may be to locate the project entity having a given primary key, in order to extract the phone number of the hr person for a project.
The printing application submits each query towards the query engine and uses the outcomes to expand web site. There is a single template for every class of document. Thus all project hand-outs retain the same information within the same format, even though hypertext project descriptions show some individual variation. We made a decision to prototype our bodies in Caml Light[Cousineau90], a dialect from the ML language.
It has proved to be a great platform for developing “proof of concept” tools, and that we have extended the fundamental system having a small library[Dobson94b] of types and processes encapsulating the common WWW operations. The lightweight database interpreter and query engine form thing about this library, which makes it easy to develop additional database-aware tools. The printing application might be accessed with the addition of a “print” icon to some page which links to some CGI script invoking the flattening system – the rightmost icon at the end of the figure generates a flattened copy from the project description having a single click. T
he consequence of the flattening process is definitely an HTML document – usually, though certainly not, without hyperlinks – which might then be printed from a browser or passed with a intermediate rendering system for more processing (especially helpful for high-quality copy).
Spanning Servers
Entities are associated using REL elements, which might contain an HREF attribute pointing towards the document containing the prospective entity, This attribute allows a credit card application to follow information while using database schema as opposed to the hyperlinks. Often the hyper structure of a document is going to be richer compared to relationship structure from the lightweight database, which means this may lessen the amount of traversal necessary.
If two servers share a typical schema for (areas of) their information, they might include relationships to every others’ entities. This enables the servers to co-ordinate their information right into a single virtual database. An example application may be where two sites co-operate in providing a bibliography of published work.
They might agree on a typical schema for references, and mark-up their reference lists accordingly. Such schemata happen to be developed for numerous purposes by other projects[Genesereth90].
However, there is a real danger that different sites might adopt subtly different types for the same information, producing a proliferation of non-interworking lightweight databases. An identical approach enables you to generate indices to information using “robots” – autonomous daemons which scavenge for information in WWW.
By publishing the semantic content of pages, sites offer better search and retrieval possibilities than simple keyword searches, and may avoid the overhead involved with speculative traversal of links by “data miner” applications. Generally a lightweight database is every bit accessible to client- or server-side processing, meaning processing may occur wherever is most effective or easiest.
Related Work Integrated Relational Databases
As stated before, several authors (e.g. [Varela94]) have reported integrating database engines to WWW. An average example is really a site’s telephone directory that is accessed utilizing a CGI script interfacing to some relational database engine for example Ingres. We see the current work as being complementary to those efforts. A light-weight database should never be a substitute for a complete RDBMS in applications with substantial searching requirements.
However many applications have substantially less intensive requirements. Our illustration of structuring a document is normal: few sites would store their documents within an RDBMS as a matter of course, but turning a document right into a lightweight database facilitates searching and re-formatting with little effort. An additional feature in our approach is it makes the database public.
The mark-up essentially publishes the dwelling of the information, instead of having it tied-up inside a package. This will make client-side querying – instead of keyword-only searching – possible, off-loading processing in the central server[DeBra94].
SGML and Hytime
Essential is the migration of WWW for the SGML and Hytime standards. These allows WWW to provide richer and much more varied mark-up styles, possibly tailoring the tags used towards precisely the sort of machine-readability we’ve been investigating using the current work. We feel that our approach allows simple, lightweight, generic databases to become constructed within webpage’s. These databases could use any appropriate data model agreed between communicating parties.
The additional elements are sufficiently simple they do not enlarge documents in order to impact on performance, and could easily be recognised and stripped from documents if necessary. SGML offers the chance of alternative, document-specific mark-up to define the dwelling of information. This enables more “targeted” mark-up. Additionally, it allows documents that do not have a clear relational structure to possess semantics added to them – clearly an issue in our approach. There’s a significant cost to moving to full SGML, however, which is by no means clear what lengths WWW will evolve within this direction.
We feel a more important point is the fact that our approach couples the semantics of the page directly using its text, which makes it difficult to support multiple views onto some pages. Essentially this is a similar problem to that particular of explicit link and anchor tags in HTML – the writer pre-defines the hyperlink (or database) structure, limiting future re-use and expansion.
Nevertheless the same techniques which decouple links from documents in Hytime or Microcosm[David93] can also be employed in lightweight databases. The choice to move towards machine- in addition to human-readable documents may be the important point you want to stress. The precise approach taken is within many respects much less important compared to results of making the net more available to automated processing.
Conclusion
We now have presented a little extension to HTML to incorporate semantic information right into a document’s mark-up. The semantics follow a fundamental schema based on a well-known database formalism, and allows the making of generic “lightweight” databases which might span servers. The mark-up can be utilized by client- or server-side applications to extract information from web pages using relational queries.
We now have presented a good example of generating printed copies of hypertext documents, using mark-up to extract sections for printing. This avoids the necessity to follow hyperlinks, and allows alternative sections to become inserted or omitted as required. Further work will focus on making queries more effective, checking the integrity of the lightweight database against a conceptual model, and exchanging data between lightweight databases and full RDBMS systems.
We feel that information retrieval using automated tools is really a valuable method of reducing the mass confusion in WWW. Making the semantic content of pages available directly raises the accessibility from the information and – by permitting improved searching and indexing – may lessen the need for keyword-only searches.
Essential is the migration of WWW for the SGML and Hytime standards. These allows WWW to provide richer and much more varied mark-up styles, possibly tailoring the tags used towards precisely the sort of machine-readability we’ve been investigating using the current work. We feel that our approach allows simple, lightweight, generic databases to become constructed within webpage’s. These databases could use any appropriate data model agreed between communicating parties.
The additional elements are sufficiently simple they do not enlarge documents in order to impact on performance, and could easily be recognised and stripped from documents if necessary. SGML offers the chance of alternative, document-specific mark-up to define the dwelling of information. This enables more “targeted” mark-up. Additionally, it allows documents that do not have a clear relational structure to possess semantics added to them – clearly an issue in our approach. There’s a significant cost to moving to full SGML, however, which is by no means clear what lengths WWW will evolve within this direction.
We feel a more important point is the fact that our approach couples the semantics of the page directly using its text, which makes it difficult to support multiple views onto some pages. Essentially this is a similar problem to that particular of explicit link and anchor tags in HTML – the writer pre-defines the hyperlink (or database) structure, limiting future re-use and expansion.
Nevertheless the same techniques which decouple links from documents in Hytime or Microcosm[David93] can also be employed in lightweight databases. The choice to move towards machine- in addition to human-readable documents may be the important point you want to stress. The precise approach taken is within many respects much less important compared to results of making the net more available to automated processing.
Conclusion
We now have presented a little extension to HTML to incorporate semantic information right into a document’s mark-up. The semantics follow a fundamental schema based on a well-known database formalism, and allows the making of generic “lightweight” databases which might span servers. The mark-up can be utilized by client- or server-side applications to extract information from web pages using relational queries.
We now have presented a good example of generating printed copies of hypertext documents, using mark-up to extract sections for printing. This avoids the necessity to follow hyperlinks, and allows alternative sections to become inserted or omitted as required. Further work will focus on making queries more effective, checking the integrity of the lightweight database against a conceptual model, and exchanging data between lightweight databases and full RDBMS systems.
We feel that information retrieval using automated tools is really a valuable method of reducing the mass confusion in WWW. Making the semantic content of pages available directly raises the accessibility from the information and – by permitting improved searching and indexing – may lessen the need for keyword-only searches.



