Saturday, September 5, 2009

XML parsing in Erlang

XML handling in Erlang is too hard for me. So I've made some survey first.

Characteristics


XML handling operation has two major phases.

  • Parse XML document and make data(tree of elements)
    • xmerl_scan (whole element tree is processed at once)
    • SAX type parser
      • xmerl_eventp
      • erlsom_sax (developed 3rd party. not bundled with Erlang OTP)

  • Access (traverse) elements within data
    • XPATH (xmerl_xpath)
    • XSLT (xmerl_xs)
    • callback (hook) function from SAX type parser
    • hand-made logic
      • traverse tree
      • extract tuple from element tree(list) by 'list comprehension' technique

So, methodology of XML parsing is characterized by Parsing and Access method

matrix of each method


parse method
Acceess method
samples on the Web
by
xmerl_scan
xmerl_xpath
Parsing Atom with ErlangSam Ruby
xmerl_scan
xmerl_xpath (with useful MACRO)
XML processing in Erlang
Torbjörn Törnkvist
xmerl_scan
hand-made (traverse tree by lists:foldl)
Return Erlang Data from XML
Muharem Hrnjadovic
xmerl_scan
hand-made (use list comprehension)
XML processing in Erlang
Hakan Mattson
xmerl_eventp
callback(hook) function
XML processing in Erlang
Torbjörn Törnkvist
erlsom_sax
callback function
XML processing in Erlang
Willem de Jong


Operation example


example-1 : emerl_scan + xpath


If you know which elements you need exactly, and source XML file is not so huge, parse by xmerl_scan, access by xmerl_xpath.
Note: As of Erlang/OTP R13B01 supports XPATH 1.0

inspired by Torbjörn Törnkvist's code.

sample xml data ("e.xml")

<Envelope>
<Title>envelope title</Title>
<InnerEnv>
<IDNUM>403276</IDNUM>
<ItemName>Name String</ItemName>
<Pages>0</Pages>
</InnerEnv>
</Envelope>

code


-module(example1).
-export([doit/1]).
-include_lib("xmerl/include/xmerl.hrl").

-define(Val(X),
(fun() ->
[#xmlElement{name = N,
content = [#xmlText{value = V}|_]}] = X,
{N,V} end)())
.

doit(File) ->
{Xml, _} = xmerl_scan:File(File),
[
?Val(xmerl_xpath:string("/Envelope/Title", Xml)),
?Val(xmerl_xpath:string("//IDNUM", Xml)),
?Val(xmerl_xpath:string("//ItemName", Xml)),
?Val(xmerl_xpath:string("//Pages", Xml))
]
.

results

1> example1:go("e.xml").
[
{'Title',"envelope title"},

{'IDNUM',"403276"},
{'ItemName',"Name String"},
{'Pages',"0"}]

example-2 : xmerl_scan + traverse element tree by lists:foldl


if you want to translate whole XML data into other scheme, you need to traverse whole tree by lists:foldl function.

inspired by Muharem Hrnjadovic's code

sample xml data ("e.xml")

<Envelope>
<Title>envelope title</Title>
<InnerEnv>
<IDNUM>403276</IDNUM>
<ItemName>Name String</ItemName>
<Pages>0</Pages>
</InnerEnv>
</Envelope>

code

-module(example2).
-export([go/1]).
-include_lib("xmerl/include/xmerl.hrl").

go(File) ->
{R, _} = xmerl_scan:file(File),
io:format("~p~n",[lists:reverse(traverse(R, []))])
.
traverse(R, L) when is_record(R, xmlElement) ->
lists:foldl(fun traverse/2, L, R#xmlElement.content) ;
traverse(#xmlText{parents=[{'Title',_},_], value=V}, L) -> [{title, V}|L];
traverse(#xmlText{parents=[{'IDNUM',_},_,_], value=V}, L) ->
[{idnum, V}|L];
traverse(#xmlText{parents=[{'ItemName',_},_,_], value=V}, L) ->
[{itemname, V}|L];
traverse(#xmlText{parents=[{'Pages',_},_,_], value=V}, L) ->
[{pages, V}|L];

traverse(_R, L) ->
L
.


results

2> example2:go("e.xml").
[{title,"envelope title"},

{idnum,"403276"},

{itemname,"Name String"},

{pages,"0"}]


example-3 : SAX type parsing operation


if the computation resouce is limited, whole XML data cannot be processed at once. So, SAX type parser callback functions to process just parsed element.

As SAX type parser, erlsom_sax is welknown. And Willem de Jong posted his code based on erlsom_sax.
see http://blog.tornkvist.org/blog.yaws?id=1193209275268448

here is the code based on xmerl_sax_parser library which is inspired by above code.

sample xml data ("e.xml")



<Envelope>

<Title>envelope title</Title>
<InnerEnv>
<IDNUM>403276</IDNUM>
<ItemName>Name String</ItemName>
<Pages>0</Pages>
</InnerEnv>
</Envelope>


code

-module(example3).
-export([go/1]).

go(File) ->

Option = [
{event_fun, fun eventfun/3}, {event_state, {[], []}} ], case xmerl_sax_parser:file(File, Option) of
 {ok,{Stack, Acc}} -> lists:reverse(Acc);
{Other} -> Other
end .

eventfun({ignorableWhitespace, _}, _, State) ->

State
;
eventfun({startElement, _, Tag, _, _}, _Location, {Stack, Acc}) -> {[Tag | Stack], Acc} ;
eventfun({characters, Value}, _Location, {[Tag | _L] = Stack, Acc}) -> {Stack, [{Tag, Value} | Acc]} ;
eventfun({endElement, _, _, _}, _Location, {[_ | L], Acc}) ->
{L, Acc} ;
eventfun(_,_,State) ->
State .



results


6> example3:go("e.xml").
[{"Title","envelope title"},
{"IDNUM","403276"},

{"ItemName","Name String"},

{"Pages","0"}]