A new compact XML algorithm without any dependencies. Its implemented as a rubygem to provide Non-native XML parser for particular usages. RubyGem at http://rubygems.org/gems/xml-motor and https://github.com/abhishekkr/rubygem_xml_motor
1. aXML-Motor
XML Document Parsing
Algorithm
version 2011.11.04
Abhishek Kumar ~=ABK=~
http://github.com/abhishekkr
http://www.twitter.com/abionic
Algorithm, Ruby Source and Gem:
[axml-motor] @GitHub: http://github.com/abhishekkr/axml-motor.git
rubygem's src @GitHub: http://github.com/abhishekkr/rubygem_xml_motor.git
gem install @RubyGems:http://rubygems.org/gems/xml-motor
Algorithm-Walk-through
Example XML Content:
<BODY>
<DIV id='banner'>
<H1>aXML-Motor</H1>
<H5>A new algorithm based compact XML Parser with <I>no
dependencies</I>.
</H5>
</DIV>
<DIV id='details'>
<SPAN class='github'>@github:
<A href='http://github.com/abhishekkr/axml-motor.git'>
axml-motor</A>
</SPAN>
<DIV class='gem'>
<SPAN id='source' class='github'>@github:
<A href='http://github.com/abhishekkr/rubygem-xml-
motor.git'>rubygem-xml-motor</A>
</SPAN>
<SPAN class='rubygems'>@rubygems:
<A href='http://rubygems.org/gems/xml-motor.git'>xml-motor</A>
</SPAN>
</DIV>
<I> It's a new algorithm implemented to build a real compact parser
(v0.0.2 has less than 200 ruby source code lines) without any
dependencies.</I>
</DIV>
</BODY>
2. [Step.1] Split the XML Content
(1.1) Split by '<'
store as XMLNodes
[0] BODY>
[1] DIV id='banner'>
[2] H1>aXML-Motor
[3] /H1>
[4] H5>A new algorithm based compact XML Parser with
[5] I>no dependencies
[6] /I>.
[7] /H5>
[8] /DIV>
[9] DIV id='details'>
[10] SPAN class='github'>@github:
[11] A href='http://github.com/abhishekkr/axml-motor.git'>axml-motor<
[12] /A>
[13] /SPAN>
[14] DIV class='gem'>
[15] SPAN id='source' class='github'>@github:
[16] A href='http://github.com/abhishekkr/rubygem-xml-motor.git'>rubygem-
xml-motor
[17] /A>
[18] /SPAN>
[19] SPAN class='rubygems'>@rubygems:
[20] A href='http://rubygems.org/gems/xml-motor.git'>xml-motor
[21] /A>
[22] /SPAN>
[23] /DIV>
[24] I> It's a new algorithm implemented to build a real compact parser
(v0.0.2 has less than 200 ruby source code lines) without any
dependencies.
[25] /I>
[26] /DIV>
[27] /BODY>
(1.2) Split previous step1.1 result by '>'
update XMLNodes
[0] [ 'BODY', '' ]
[1] ['DIV id='banner', '' ]
[2] ['H1', 'aXML-Motor' ]
[3] ['/H1', '']
[4] ['H5', 'A new algorithm based compact XML Parser with ']
[5] ['I', 'no dependencies']
[6] ['/I', '.']
[7] ['/H5', '']
[8] ['/DIV', '']
[9] ['DIV id='details'', '']
[10] ['SPAN class='github'', '@github: ']
[11] ['A href='http://github.com/abhishekkr/axml-motor.git'',
'axml-motor']
3. [12] ['/A', '']
[13] ['/SPAN', '']
[14] ['DIV class='gem'', '']
[15] ['SPAN id='source' class='github'', '@github: ']
[16] ['A href='http://github.com/abhishekkr/rubygem-xml-motor.git'',
'rubygem-xml-motor']
[17] ['/A', '']
[18] ['/SPAN', '']
[19] ['SPAN class='rubygems'', '@rubygems: ']
[20] ['A href='http://rubygems.org/gems/xml-motor.git', 'xml-motor']
[21] ['/A', '']
[22] ['/SPAN', '']
[23] ['/DIV', '']
[24] ['I', 'It's a new algorithm implemented to build a real compact
parser (v0.0.2 has less than 200 ruby source code lines) without
any dependencies.']
[25] ['/I', '']
[26] ['/DIV', '']
[27] ['/BODY', '']
(1.3) Split first element per line by space/tab, mark 1 st part as
tag_name and split latter part by '=', iterating to make
key=value pair per attribute... turning XMLNodes to
update XMLNodes
[0] [ ['BODY', {}], '' ]
[1] [ ['DIV', {'id'=>'banner'}], '' ]
[2] [ ['H1', {}], 'aXML-Motor' ]
[3] [ ['/H1', {}], '']
[4] [ ['H5', {}], 'A new algorithm based compact XML Parser with ']
[5] [ ['I', {}], 'no dependencies']
[6] [ ['/I', {}], '.']
[7] [ ['/H5', {}], '']
[8] [ ['/DIV', {}], '']
[9] [ ['DIV', {'id'=>'details'}], '']
[10] [ ['SPAN', {'class'=>'github'}], '@github: ']
[11] [ ['A', {'href'='http://github.com/abhishekkr/axml-motor.git'}],
'axml-motor']
[12] [ ['/A', {}], '']
[13] [ ['/SPAN', {}], '']
[14] [ ['DIV', {'class'=>'gem'}], '']
[15] [ ['SPAN', {'id'=>'source', 'class'=>'github'}], '@github: ']
[16] [ ['A',
{'href'=>'http://github.com/abhishekkr/rubygem-xml-motor.git'}],
'rubygem-xml-motor']
[17] [ ['/A', {}], '']
[18] [ ['/SPAN', {}], '']
[19] [ ['SPAN', {'class'=>'rubygems'}], '@rubygems: ']
[20] [ ['A', {'href'=>'http://rubygems.org/gems/xml-motor.git'}],
'xml-motor']
[21] [ ['/A', {}], '']
[22] [ ['/SPAN', {}], '']
[23] [ ['/DIV', {}], '']
[24] [ ['I', {}], 'It's a new algorithm implemented to build a real
4. compact parser (v0.0.2 has less than 200 ruby source
code lines) without any dependencies.']
[25] [ ['/I', {}], '']
[26] [ ['/DIV', {}], '']
[27] [ ['/BODY', {}], '']
Here, we have the XMLNodes as we wanted them.
Now it's turn to Indexify them.
[Step.2] Index the processed XMLNodes
There are three things involved in Indexing of XMLNodes
Tag_Name :
Iterating through all elements of XMLNodes, every element has three components
including Tag Name, which is available at XMLNodes.all[ [TAG_NAMES, *], *]
Depth:
The place/level of the Node in XML Node Tree starting from '0'.
Index:
The index value of Node as per depending upon the XMLNode Array
How to Index-ify?
There will be an element per Tag_Name with a Hash of Keys as the 'Depth' where it is
found which has array of 2*number_of_nodes (starting and ending 'Index' for that same
Node)
Example:
From above XMLNodes, the ['DIV'] would hold {1=>[1,8, 9,26], 2=>[14,26]}
Because 'Tag_Name' DIV has 'Index' set of 1,8 and 9,26 for 'Depth' of 1.
Similarly 'Index' set of 14,26 for 'Depth' of 2.
Indexed XMLTags for above processed XMLNodes will be as follows:
calculated XMLTags
['BODY'] = {0=>[0,27]}
['DIV'] = {1=>[1,8, 9,26], 2=>[14,23]}
['H1'] = {2=>[2,3]}
['H5'] = {2=>[4,7]}
['I'] => {3=>[5,6], 2=>[24,25]}
['SPAN'] => {2=>[10,13], 3=>[15,18, 19,22]}
['A'] => {3=>[11,12], 4=>[16,17, 20,21]}
[Step.3] Grab My Node from processed XMLNodes using XMLTags
Now suppose, I aim for a Tag_Name 'XYZ'..... then look for XMLTags['XYZ'], iterate
through all of its depths and extract 2 indexes at a time. These two indexes per time
indicate the start and end node, fetch all value within those nodes from XMLNode.
5. This will return set of values held by Tag_Name 'XYZ'.
Suppose a tree form is provided as 'ABC.XYZ', then start from top nodes as 'ABC' in
this context.
Grab all it's node. Now move on to lower nodes and filter the Indexes found only within
the Node Index ranges provided by the earlier node. This would end with the filtered set
of Indexes for 'XYZ' falling only under the Index-Range of 'ABC'.
To check for a Tag_Name with attribute, for every filtered Index-Range, just check if it
has the required attribute as it's key-value pair.
Example:
Case: Grabbing 'SPAN', with attribute “class=''github'”
It's a single node, grab all its Index-Range (10,13), (15,18) and (19,22).
Here, just XMLNodes[10] and XMLNodes[15] have required attribute.
Now, grab all data between XMLNodes[10][1] to XMLNodes[13-1][1] and
XMLNodes[15][1] to XMLNodes[18-1][1].
Result:
['@github: <A href='http://github.com/abhishekkr/axml-motor.git'>axml-motor</A>' ,
'@github: <A href='http://github.com/abhishekkr/rubygem-xml-motor.git'>rubygem-xml-
motor</A>']
Case: Grabbing 'H5.I'
Top node is 'H5', grab all its Index-Range (4,7).
Second node 'I', grab all falling between ranges from previous node (5,6).
Now, grab all data between XMLNodes[5][1] to XMLNodes[6-1][1]..
Result:
['no dependencies']
Below, you'll also see that you need not give entire hierarchy to fetch any
descendant from child tree of any node. Just giving the major scope nodes would do
the work as fine as providing exact hierarchy.
Case: Grabbing 'DIV.A'
Top node is 'DIV', grab all its Index-Range (1,8), (9,26) and (14,23).
Second node 'A', grab all falling between ranges from previous node (11,12), (16,17)
and (20,21).
Now, grab all data between XMLNodes[5][1] to XMLNodes[6-1][1]..
Result:
['axml-motor', 'rubygem-xml-motor', 'xml-motor']