Scala XML Data Processing: Efficient Spark Pipelines

Category Product engineering

As data scientists and machine learning engineers, we don’t always appreciate that most of the data we get are usually in CSV or at times in JSON file format. In reality, this is great, as we need to deal with large volumes of data and any format that makes it easy to read and understand data should be highly appreciated. And people who work with CSV know how great this is as a data format. In this blog we will be focusing on Scala XML data processing

Having said that, it might not always be the case. If you are a Scala developer (a JVM language), you are likely to work in a Java environment. And since XML has been the preferred format for data interchange, you are most likely to receive data in an XML format. Which means, you will need to parse the data from XML files and build data pipelines out of it.

XML, which stands for Extensible Markup Language, was thought of as a way in which both computers and humans should be able to understand the text. Of course, the designers got their inspiration from the hugely successful HTML. You might argue that no one actually reads HTML and we only see the final output thrown by the browsers. Well, may be, it was assumed that XML would be read only by developers and hence it should work. But then we moved to Service Oriented Architecture (SOA) where XML has become the standard data-format for communication between services. In this post we will see how we can parse XML in spark-Scala.

<?xml version="1.0"?>
<!DOCTYPE PARTS SYSTEM "parts.dtd">
<?xml-stylesheet type="text/css" href="xmlpartsstyle.css"?>
<PARTS>
   <TITLE>Computer Parts</TITLE>
   <PART>
      <ITEM>Motherboard</ITEM>
      <MANUFACTURER>ASUS</MANUFACTURER>
      <MODEL>P3B-F</MODEL>
      <COST> 123.00</COST>
   </PART>
   <PART>
      <ITEM>Video Card</ITEM>
      <MANUFACTURER>ATI</MANUFACTURER>
      <MODEL>All-in-Wonder Pro</MODEL>
      <COST> 160.00</COST>
   </PART>
   <PART>
      <ITEM>Sound Card</ITEM>
      <MANUFACTURER>Creative Labs</MANUFACTURER>
      <MODEL>Sound Blaster Live</MODEL>
      <COST> 80.00</COST>
   </PART>
   <PART>
      <ITEMᡋ inch Monitor</ITEM>
      <MANUFACTURER>LG Electronics</MANUFACTURER>
      <MODEL> 995E</MODEL>
      <COST> 290.00</COST>
   </PART>
</PARTS>

Table 1 A simple XML file (1)

Interestingly it’s quite easy to parse and create XML pipelines in Scala. To load an XML file, you will need to pass the filename to the loadFile utility from XML. Please note that parsing the whole file requires a lot of processing power and therefore, chances are that you may run into ‘OutOfMemoryError’ as shown in table 2.

scala> import scala.xml.XML
import scala.xml.XML
scala> val xml = XML.loadFile("data/Posts.xml")
java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:3332)
...
scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:345)
  at .$print$lzycompute(<console>:10)
  at .$print(<console>:6)
  at $print(<console>)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Table 2 Scala code first run

In case that happens, you will need to boost the memory for the spark driver. I am allocating a random high number here (as shown in table 3).

spark-shell --driver-memory 6G

Table 3 Increased driver memory

Now I can parse the XML with ease (refer to table 4).

scala> import scala.xml.XML
import scala.xml.XMLscala> val xml = XML.loadFile("data/Posts.xml")xml: scala.xml.Elem =
<posts>
  <row FavoriteCount="1" CommentCount="1" AnswerCount="1" Tags="&lt;job-search&gt;&lt;visa&gt;&lt;japan&gt;" Title="What kind of Visa is required to work in Academia in Japan?" LastActivityDate="2013-10-30T09:14:11.633" LastEditDate="2013-10-30T09:14:11.633" LastEditorUserId="2700" OwnerUserId="5" Body="&lt;p&gt;As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan ? &lt;/p&gt;
" ViewCount="415" Score="16" CreationDate="2012-02-14T20:23:40.127" AcceptedAnswerId="180" PostTypeId="1" Id="1"/>
  <row ClosedDate="2015-03-29T20:06:49.947" CommentCount="2" AnswerCount="2" Tags="&lt;phd&gt;&lt;job-search&gt;&lt;online-resource&gt;&lt;chemistry&gt;" Title="As a computational chemist, which online resources are avail...

Table 4 Read XML

I have not talked about the file data that I am using till now. The xml dataset is a stackoverflow dataset downloaded from archive.org. It contains data in this format.

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="180" CreationDate="2012-02-14T20:23:40.127" Score="16" ViewCount="415" Body="&lt;p&gt;As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan ? &lt;/p&gt;&#xA;" OwnerUserId="5" LastEditorUserId="2700" LastEditDate="2013-10-30T09:14:11.633" LastActivityDate="2013-10-30T09:14:11.633" Title="What kind of Visa is required to work in Academia in Japan?" Tags="&lt;job-search&gt;&lt;visa&gt;&lt;japan&gt;" AnswerCount="1" CommentCount="1" FavoriteCount="1" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="246" CreationDate="2012-02-14T20:26:22.683" Score="11" ViewCount="725" Body="&lt;p&gt;Which online resources are available for job search at the Ph.D. level in the computational chemistry field?&lt;/p&gt;&#xA;" OwnerUserId="5" LastEditorUserId="15723" LastEditDate="2014-09-18T13:02:01.180" LastActivityDate="2014-09-18T13:02:01.180" Title="As a computational chemist, which online resources are available for Ph.D. level jobs?" Tags="&lt;phd&gt;&lt;job-search&gt;&lt;online-resource&gt;&lt;chemistry&gt;" AnswerCount="2" CommentCount="2" ClosedDate="2015-03-29T20:06:49.947" />

Table 5 Our target dataset

Each record is allocated a row tag, that combines multiple attributes. We can now parse these individual tags and get the value of the attributes. To parse the records, you will need to search an XML tree for the required data, using XPath expressions. The way it works is that you need to pass \ and \\ methods for the equivalent XPath / and // expressions.

For example, you can get the ‘row’ tags and then on each record get the ‘body’ attribute. This gives us a sequence of scala.Option.

scala> val texts = (xml \ "row").map{_.attribute("Body")}texts: scala.collection.immutable.Seq[Option[Seq[scala.xml.Node]]] =
List(Some(&lt;p&gt;As from title. What kind of visa class do I have to apply for, in order to work as an academic in Japan ? &lt;/p&gt;
), Some(&lt;p&gt;Which online resources are available for job search at the Ph.D. level in the computational chemistry field?&lt;/p&gt;
), Some(&lt;p&gt;As from title. Not all journals provide the impact factor on their homepage. For those who don't where can I find their impact factor ?&lt;/p&gt;
), Some(&lt;p&gt;I have seen many engineering departments want professional engineer registration. Why do they care? &lt;/p&gt;
), Some(&lt;p&gt;What is the h-index, and how does it work ?&lt;/p&gt;
), Some(&lt;p&gt;If your institution has a subscription to Journal Citation Reports (JCR), you...

Table 6 Getting the appropriate text

Now that you have an iterator you can run complex transformations on top of it.

Below (in table 7) we are converting texts to string, trimming them for extra whitespace, then filtering out the text with some string, and converting them to lowercase.

scala> val lower_texts = texts map {_.toString} map { _.trim } filter { _.length != 0 } map { _.toLowerCase }lower_texts: scala.collection.immutable.Seq[String] =
List(some(&lt;p&gt;as from title. what kind of visa class do i have to apply for, in order to work as an academic in japan ? &lt;/p&gt;
), some(&lt;p&gt;which online resources are available for job search at the ph.d. level in the computational chemistry field?&lt;/p&gt;
), some(&lt;p&gt;as from title. not all journals provide the impact factor on their homepage. for those who don't where can i find their impact factor ?&lt;/p&gt;
), some(&lt;p&gt;i have seen many engineering departments want professional engineer registration. why do they care? &lt;/p&gt;
), some(&lt;p&gt;what is the h-index, and how does it work ?&lt;/p&gt;
), some(&lt;p&gt;if your institution has a subscription to journal citation reports (jcr), you can check it t...

Table 7 Spark transformation

Scala offers a convenient and easy way for basic XML processing. This post is aimed at helping beginners use XML and Scala with ease. If you found this useful, do leave a comment, we would love to hear from you and share the post with your friends and colleagues.

Recommended Reading

Explore our insightful articles, whitepapers, and case studies that delve deeper into the latest industry trends, best practices, and success stories. Gain valuable knowledge and stay informed about the ever-evolving landscape of digital transformation.

Navigating The Two Major Data Trends in 2024

As the data landscape continues to evolve rapidly, businesses are compelled to stay abreast of emerging trends to maintain competitiveness. In the year 2024, two prominent trends are poised to redefine data analytics: the proliferation of Generative AI and the adoption of modern data contracts. These trends not only reshape how organizations utilize data but also underscore the importance of ethical considerations and robust governance in data management. This article explores these trends in-depth, providing insights into effective strategies for implementation and the implications for businesses navigating the data landscape.Trend #1: The Ascendancy of Generative AIGenerative AI, characterized by its ability to create new content autonomously, has gained significant traction across industries. The advent of large language models (LLMs) has propelled Generative AI into the mainstream, with tech giants like Microsoft, Google, and Meta integrating Generative AI capabilities into their products. As businesses increasingly rely on AI-driven insights, Generative AI is poised to become an indispensable tool for enhancing productivity and driving innovation.Strategy for Effective Implementation:To leverage Generative AI effectively, businesses must develop a comprehensive strategy tailored to their specific needs and objectives. This strategy should encompass several key components:Identifying suitable use cases:Organizations should identify areas where Generative AI can augment existing processes and generate tangible value. Whether it’s automating content creation, personalizing customer experiences, employee training, or optimizing business operations, identifying the right use cases is essential for maximizing ROI.Comprehensive employee training:Implementing Generative AI requires upskilling employees to ensure they can effectively utilize AI tools while adhering to ethical guidelines and best practices. Training programs should cover topics such as data privacy, bias mitigation, and ethical AI usage to foster a culture of responsible AI adoption.Strong data governance:Robust data governance is critical for ensuring the accuracy, security, and ethical usage of AI-generated insights. Organizations must establish clear guidelines and protocols for data collection, storage, and usage to mitigate risks associated with data misuse or bias.Managing costs and licensing:While Generative AI offers immense potential, it also comes with significant costs, both in terms of technology investments and licensing fees. Organizations must develop a cost-effective strategy for scaling AI initiatives while ensuring compliance with budgetary constraints.Balancing automation and human judgment:While AI-driven insights can enhance decision-making processes, it’s essential to strike a balance between automation and human judgment. Human oversight is crucial for interpreting AI-generated insights, identifying biases, and ensuring ethical decision-making.Ethical considerations:As AI becomes increasingly integrated into business operations, organizations must prioritize ethical considerations and accountability. This includes addressing issues related to data privacy, algorithmic bias, and the potential societal impact of AI-driven decisions.Trend #2: Adoption of Modern Data ContractsModern data contracts have emerged as a solution to streamline data usage and sharing, effectively addressing the challenges associated with broken data integrations and communication gaps between application and analytics teams.Structured Data Interactions:Modern data contracts represent a paradigm shift in how organizations manage data interactions. Unlike traditional contracts, which are static and cumbersome to maintain, modern data contracts are dynamic agreements that evolve with changing data requirements and business needs.Integration into workflows:By integrating data contracts into existing workflows and development processes, organizations can ensure seamless data interactions across disparate systems and applications. This integration enables teams to collaborate more effectively, reducing friction and improving data quality and consistency.Implementation Strategies:Implementing modern data contracts requires a strategic approach focused on collaboration, standardization, and automation. Key strategies include:Developing clear standards:Organizations should establish clear standards and guidelines for data contracts, outlining key parameters such as data formats, schemas, and validation rules. These standards help ensure consistency and interoperability across data systems and applications.Instituting change controls:Change management processes are essential for managing versioning and ensuring smooth transitions between data contract iterations. By implementing robust change controls, organizations can minimize disruptions and maintain data integrity throughout the contract lifecycle.Training and tools:Equipping teams with the necessary training and tools is crucial for successful data contract implementation. Training programs should cover topics such as contract management, data governance, and compliance, while tools such as data modeling platforms and contract management software can streamline the contract development and deployment process.As businesses navigate the complexities of the data landscape in 2024, adapting to the rise of Generative AI and modern data contracts is essential for driving innovation and maintaining competitiveness. By developing comprehensive strategies for AI adoption and data governance, organizations can harness the transformative power of Generative AI while ensuring ethical and responsible data usage. Likewise, embracing modern data contracts enables organizations to streamline data interactions, improve collaboration, and enhance data quality and consistency. By embracing these trends and implementing best practices, businesses can unlock new opportunities for growth and success in the digital age.

Learn More >

The Journey from System Admin to DevOps Superstar

According to a report by the Economic Times, when organizations cultivate a better work environment, the overall experience improves exponentially. They find true meaning in their jobs by prioritizing employees’ mettle, exceeding expectations, and work allocation.Employees seek exposure and opportunities in their jobs. By building productivity and customer satisfaction they enhance their portfolio.Radhakrishnan one of our DevOps superstars, has contributed with his service and time for over 8 years. To commemorate this everlasting relationship we got into a candid conversation with him. Here’s what he had to say about his journey before and with Nineleaps.Radhakrishnan is originally from a small town near Bengaluru, Hosur. After completing his MBA, his interest developed in computers and networking. He successfully gained appropriate knowledge by undertaking network courses and embarked on a journey to becoming a system admin. He enjoyed working for various companies as a system admin.Then came Nineleaps which gave new horizons of opportunities to his mettle. When we asked him about his transition from a system admin to a DevOps engineer, he fondly remembered a quote given to him by our CEO on the day of his selection.“You are on the flight now, just fly,” — Divy Shrivastava.And, so he did.Divy’s words of confidence boosted his resolve. The walk towards DevOps became a sprint, as multiple iterations of knowledge and experience suffused him. The arena of his work leaped and much to his admiration, he realized DevOps to be his passion and soul.Right from the get-go, an intensive training regimen, honing his skills, immersing himself in countless hours of study, and shadowing esteemed senior members of our organization he grasped the crucial importance of comprehending tasks and prioritizing them effectively. Driven by an unwavering desire to learn and prove his mettle, his transition from a system admin to a DevOps maestro was seamless. Multiple training sessions helped him get a deeper understanding of internal and external projects as well as the product, giving him never-to-dull confidence.Learning and development, knowledge transfers, and peer learning are certainly at the core of Nineleaps which helped him become the super engineer he is today. These trainings were both from the client’s side as well as in-house learning at Nineleaps.“In my opinion what sets Nineleaps apart is our dynamic and flexible approach to projects, with extensive focus on Agile methodology we are trained and nourished to build quality solutions for our clients, and also are facilitated with high-tech exposure by working with industry giants and rewarded with the utmost respect and growth opportunities.”To understand more closely we asked him about the challenges he faced at times, and according to him, documentation was a challenge. He feels all the work that the employee is doing must be documented and organized in a proper way as it will help them in the future. He also informed about instances where a person working on a specific problem might face similar challenges later in the same week and not be able to recall what the solution was properly, in such cases documenting everything became important. The organization’s culture was very open and asking questions or requesting help was never an issue which facilitated collaboration in resolving such challenges.Nineleaps became the crucible to test his mettle and with each strike of the hammer, a superstar was born.

Learn More >

Performance Testing Trends: Future of Software Optimization

Performance testing is an integral part of the software development lifecycle as it helps determine the scalability, stability, speed, and responsiveness of an application as compared to the workload given. It is not a standalone process and should be run throughout the software development process.It serves the purpose of assessing various aspects of an application’s performance, such as application output, processing speed, data transfer velocity, network bandwidth usage, maximum concurrent users, memory utilization, workload efficiency, and command response times. By evaluating these metrics, performance testers can gain valuable insights into the application’s capabilities and identify any areas that require improvementUsing AI to automate testing:Performance testing encompasses various stages, each posing unique challenges throughout the testing lifecycle. These challenges include test preparation, execution, identifying performance bottlenecks, pinpointing root causes, and implementing effective solutions. AI can help reduce or even eliminate these differences. AI-powered systems can handle the mountains of data collected during performance testing and be able to produce efficient and accurate analyses. AI can also identify the sources of performance slowdowns in complex systems, which can otherwise be tedious to pinpoint. With AI-driven automation, performance testers can streamline the testing process, ultimately saving time and resources while ensuring reliable results.Open Architecture:Performance testing, which evaluates how well a system performs, is undergoing a significant shift away from relying solely on browser-based evaluations. Instead, internet protocols like TCP/IP are being adopted for comprehensive performance monitoring. This approach emphasizes the need for system components to work together harmoniously while assessing their performance individually. The integration of cloud-based environments has become crucial, as cloud computing is an integral part of modern technology infrastructure. Cloud-based environments provide a flexible and reliable platform that enables seamless integration and coordination of various components, ultimately leading to enhanced system performance. It is crucial to prioritize comprehensive performance testing, which involves evaluating individual component performance, managing loads, monitoring in real-time, and debugging, to ensure optimal system performance.Self Service:When adopting the aforementioned trends, it’s essential to consider practical implementation tips for successful outcomes. For instance, performance engineers can use AI-powered tools to analyze performance data more effectively, leading to more accurate and actionable insights. Integrating cloud-based solutions can provide the flexibility and scalability required for modern performance testing demands. As stakeholders implement these trends, the collaboration between development, testing, and IT operations teams becomes crucial for successful integration and improved application performance.SaaS-based Tools:Testers can now easily set up and execute tests at cloud scale within minutes, thanks to the convergence of self-service, cloud-based testing, SaaS, and open architecture. Unlike older desktop-based tools that demand extensive setup, the emerging tools simplify the process with just a few clicks. Furthermore, these modern technologies offer seamless interoperability, significantly enhancing performance capabilities.Changing Requirements:In classic app testing, testers had to make educated guesses about the software’s use and create requirements and service-level agreements accordingly. However, in DevOps-oriented environments, performance requirements are seen as dynamic and evolving. Traditional requirements are now driven by complex use cases, accommodating different user experiences across various devices and locations. Performance engineering plays a critical role in continuously monitoring systems and proactively identifying and resolving issues before they can negatively impact customer retention or sales.Sentiment analysis:Monitoring production provides insight into server response times but does not capture the true customer experience. Synthetic transactions, on the other hand, simulate real user actions in production continuously. They can range from basic interactions like logging into an e-commerce site and adding products to a cart, to more complex transactions that track performance end to end without actually completing real orders or charging credit cards. Tracking the actual user experience is crucial for identifying bottlenecks, delays, and errors in real-time, as some issues may go unreported by users. Sentiment analysis is a powerful technology that evaluates customer responses based on emotions, providing valuable insights from customers’ reactions expressed in plain text and assigning numerical sentiment scores.Chaos Testing:Chaos testing is a disciplined methodology that proactively simulates and identifies failures in a system to prevent unplanned downtime and ensure a positive user experience. By understanding how the application responds to failures in various parts of the architecture, chaos testing helps uncover uncertainties in the production environment. The main objective is to assess the system’s behavior in the event of failures and identify potential issues. For instance, if one web service experiences downtime, chaos testing ensures that the entire infrastructure does not collapse. This approach helps identify system weaknesses and addresses them before reaching the production stage.Conclusion:As software development continues to evolve, performance testing must keep pace with emerging trends and technologies. By leveraging AI-driven automation, open architecture with cloud integration, and practical implementation tips, stakeholders can optimize their performance testing processes to deliver high-performing and responsive software applications. Real-world examples and a focus on key performance metrics ensure that these trends are not only understood but effectively implemented to achieve the desired outcomes. Embracing these trends empowers software development teams to elevate the user experience, enhance customer satisfaction, and drive business success.

Learn More >

Navigating The Two Major Data Trends in 2024

Learn More >

The Journey from System Admin to DevOps Superstar

Learn More >

Performance Testing Trends: Future of Software Optimization

Learn More >

Ready to embark on a transformative journey? Connect with our experts and fuel your growth today!

Explore Now

Accelerators

Trending Blogs

Scala XML Data Processing: Efficient Spark Pipelines

Recommended Reading

Navigating The Two Major Data Trends in 2024

The Journey from System Admin to DevOps Superstar

Performance Testing Trends: Future of Software Optimization

Navigating The Two Major Data Trends in 2024

The Journey from System Admin to DevOps Superstar

Performance Testing Trends: Future of Software Optimization

Ready to embark on a transformative journey? Connect with our experts and fuel your growth today!