Scala XML Data Processing: Efficient Spark Pipelines
As data scientists and machine learning engineers, we don’t always appreciate that most of the data we get are usually in CSV or at times in JSON file format. In reality, this is great, as we need to deal with large volumes of data and any format that makes it easy to read and understand data should be highly appreciated. And people who work with CSV know how great this is as a data format. In this blog we will be focusing on Scala XML data processingHaving said that, it might not always be the case. If you are a Scala developer (a JVM language), you are likely to work in a Java environment. And since XML has been the preferred format for data interchange, you are most likely to receive data in an XML format. Which means, you will need to parse the data from XML files and build data pipelines out of it.XML, which stands for Extensible Markup Language, was thought of as a way in which both computers and humans should be able to understand the text. Of course, the designers got their inspiration from the hugely successful HTML. You might argue that no one actually reads HTML and we only see the final output thrown by the browsers. Well, may be, it was assumed that XML would be read only by developers and hence it should work. But then we moved to Service Oriented Architecture (SOA) where XML has become the standard data-format for communication between services. In this post we will see how we can parse XML in spark-Scala.<?xml version="1.0"?><!DOCTYPE PARTS SYSTEM "parts.dtd"><?xml-stylesheet type="text/css" href="xmlpartsstyle.css"?><PARTS> <TITLE>Computer Parts</TITLE> <PART> <ITEM>Motherboard</ITEM> <MANUFACTURER>ASUS</MANUFACTURER> <MODEL>P3B-F</MODEL> <COST> 123.00</COST> </PART> <PART> <ITEM>Video Card</ITEM> <MANUFACTURER>ATI</MANUFACTURER> <MODEL>All-in-Wonder Pro</MODEL> <COST> 160.00</COST> </PART> <PART> <ITEM>Sound Card</ITEM> <MANUFACTURER>Creative Labs</MANUFACTURER> <MODEL>Sound Blaster Live</MODEL> <COST> 80.00</COST> </PART> <PART> <ITEMᡋ inch Monitor</ITEM> <MANUFACTURER>LG Electronics</MANUFACTURER> <MODEL> 995E</MODEL> <COST> 290.00</COST> </PART></PARTS>Table 1 A simple XML file (1)Interestingly it’s quite easy to parse and create XML pipelines in Scala. To load an XML file, you will need to pass the filename to the loadFile utility from XML. Please note that parsing the whole file requires a lot of processing power and therefore, chances are that you may run into ‘OutOfMemoryError’ as shown in table 2.scala> import scala.xml.XMLimport scala.xml.XMLscala> val xml = XML.loadFile("data/Posts.xml")java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332)...scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:345) at .$print$lzycompute(<console>:10) at .$print(<console>:6) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)Table 2 Scala code first runIn case that happens, you will need to boost the memory for the spark driver. I am allocating a random high number here (as shown in table 3).spark-shell --driver-memory 6GTable 3 Increased driver memoryNow I can parse the XML with ease (refer to table 4).scala> import scala.xml.XMLimport scala.xml.XMLscala> val xml = XML.loadFile("data/Posts.xml")xml: scala.xml.Elem =<posts> <row FavoriteCount="1" CommentCount="1" AnswerCount="1" Tags="<job-search><visa><japan>" Title="What kind of Visa is required to work in Academia in Japan?" LastActivityDate="2013-10-30T09:14:11.633" LastEditDate="2013-10-30T09:14:11.633" LastEditorUserId="2700" OwnerUserId="5" Body="<p>As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan ? </p>" ViewCount="415" Score="16" CreationDate="2012-02-14T20:23:40.127" AcceptedAnswerId="180" PostTypeId="1" Id="1"/> <row ClosedDate="2015-03-29T20:06:49.947" CommentCount="2" AnswerCount="2" Tags="<phd><job-search><online-resource><chemistry>" Title="As a computational chemist, which online resources are avail...Table 4 Read XMLI have not talked about the file data that I am using till now. The xml dataset is a stackoverflow dataset downloaded from archive.org. It contains data in this format.<?xml version="1.0" encoding="utf-8"?><posts> <row Id="1" PostTypeId="1" AcceptedAnswerId="180" CreationDate="2012-02-14T20:23:40.127" Score="16" ViewCount="415" Body="<p>As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan ? </p>
" OwnerUserId="5" LastEditorUserId="2700" LastEditDate="2013-10-30T09:14:11.633" LastActivityDate="2013-10-30T09:14:11.633" Title="What kind of Visa is required to work in Academia in Japan?" Tags="<job-search><visa><japan>" AnswerCount="1" CommentCount="1" FavoriteCount="1" /> <row Id="2" PostTypeId="1" AcceptedAnswerId="246" CreationDate="2012-02-14T20:26:22.683" Score="11" ViewCount="725" Body="<p>Which online resources are available for job search at the Ph.D. level in the computational chemistry field?</p>
" OwnerUserId="5" LastEditorUserId="15723" LastEditDate="2014-09-18T13:02:01.180" LastActivityDate="2014-09-18T13:02:01.180" Title="As a computational chemist, which online resources are available for Ph.D. level jobs?" Tags="<phd><job-search><online-resource><chemistry>" AnswerCount="2" CommentCount="2" ClosedDate="2015-03-29T20:06:49.947" />Table 5 Our target datasetEach record is allocated a row tag, that combines multiple attributes. We can now parse these individual tags and get the value of the attributes. To parse the records, you will need to search an XML tree for the required data, using XPath expressions. The way it works is that you need to pass \ and \\ methods for the equivalent XPath / and // expressions.For example, you can get the ‘row’ tags and then on each record get the ‘body’ attribute. This gives us a sequence of scala.Option.scala> val texts = (xml \ "row").map{_.attribute("Body")}texts: scala.collection.immutable.Seq[Option[Seq[scala.xml.Node]]] =List(Some(<p>As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan ? </p>), Some(<p>Which online resources are available for job search at the Ph.D. level in the computational chemistry field?</p>), Some(<p>As from title. Not all journals provide the impact factor on their homepage. For those who don't where can I find their impact factor ?</p>), Some(<p>I have seen many engineering departments want professional engineer registration. Why do they care? </p>), Some(<p>What is the h-index, and how does it work ?</p>), Some(<p>If your institution has a subscription to Journal Citation Reports (JCR), you...Table 6 Getting the appropriate textNow that you have an iterator you can run complex transformations on top of it.Below (in table 7) we are converting texts to string, trimming them for extra whitespace, then filtering out the text with some string, and converting them to lowercase.scala> val lower_texts = texts map {_.toString} map { _.trim } filter { _.length != 0 } map { _.toLowerCase }lower_texts: scala.collection.immutable.Seq[String] =List(some(<p>as from title. what kind of visa class do i have to apply for, in order to work as an academic in japan ? </p>), some(<p>which online resources are available for job search at the ph.d. level in the computational chemistry field?</p>), some(<p>as from title. not all journals provide the impact factor on their homepage. for those who don't where can i find their impact factor ?</p>), some(<p>i have seen many engineering departments want professional engineer registration. why do they care? </p>), some(<p>what is the h-index, and how does it work ?</p>), some(<p>if your institution has a subscription to journal citation reports (jcr), you can check it t...Table 7 Spark transformationScala offers a convenient and easy way for basic XML processing. This post is aimed at helping beginners use XML and Scala with ease. If you found this useful, do leave a comment, we would love to hear from you and share the post with your friends and colleagues.
Learn More >