Friday, April 4, 2014

Extracting Data from XML files using XSL and XPath...5

Accessing another XML file outside of current file

If we need to access a second XML file and get values from that file, it is possible to do that in XSLT.
The above tag will access the ‘server’ node of 'lt00tm400000001_life_Life_Repository_TestSuites.xml' file and store that value in ‘TCVersion’ variable. Hence, moving forward we can access any node in that file using this variable, as below:
Note that ‘//’ has been used to get directly to testcase_version.

Tackling the comma ‘,’

Sometimes there are values that can have commas. In that case, for our convenience we can use a construct that detects a comma (‘,’) in a value and replaces it with a space (‘ ‘) or any other separator (such as a ‘;‘or a ‘|’). Example:

In the above code snippet, we have used the xsl:choose-when-otherwise construct. If the testcase_name contains a comma, it will replaced by a space, ‘otherwise’ we will take the testcase_name as is.
Notice the ‘contains’ and ‘translate‘functions used. ‘contains’ will detect the presence of comma in ‘testcase_name’ and return true or false. If true, the translate function will replace comma with a space, else ‘testcase_name’ will be selected as is. The above code can be shortened to:
Just by using this, we can accomplish what we want. If case a testcase_name contains a ‘,’, it will get replaced by a ‘ ‘.This statement takes care of appearance of all ‘,’s coming in testcase_name.

More on Substring

There is another function that helps us in getting a part of a string. This was particularly helpful when the defect_id in the XML file was something like – ‘XXXXX_YYYYYY_12345’. Out of such a string the only useful part is ‘12345’. This can be accomplished by using a code as:
In the ‘Defect’ variable, ‘12345’ will be stored. Note that the substring-after function is used twice to remove the two underscores. First the inner bracket is executed. Hence, we get ‘YYYYYY_12345’ after first substring-after is executed. The second one then removes ‘YYYYYY_’ to store only ‘12345’ in ‘Defect’.

Accessing another level (getting into more depth)

Just take note of the structure of the XML again.
<datasource>
<!--- datasource details -->
<repository>
<!---repository details -->
<testplan>
<!--- ‘testplan’ details (custom attributes at test plan level) -->
<build>
<!-- build details -->
<platform>
<!-- platform details -->
<testcase>
<!—test case level custom attributes -->
</testcase>
</platform>
</build>
</testplan>
</repository>
</datasource>

To access the parent tags and the corresponding xml tags within those tags we can use ‘../’. So, how can this be more useful? Consider a scenario where we need all the test cases from the file with details such as datasource_description, repository, testplan, build, platform, and testcase execution status and other details. How can we do that?
Just dive deep down to the level of a testcase and get its details. Also, get the other values using ‘../’ several times depending upon the parenthood of tags. Following code does that:
Notice the appearance of ‘../’, a number of times, to access the various fields that are not at the level of ‘testcase’.

Moving to next line (line feed)

While creating a csv file, the control needs to move to the next line for the next record. This can be accomplished by using the following construct:

Note that $Defect is the value of variable after which we need a line break.
<xsl:text>&#xa;</xsl:text> inserts a line break so that for next ‘for-each’ the entries start populating on the next row.

Handling XML with non-regular structure

Sometimes the XML on which we need to work might not be structured uniformly. By uniform we mean that, the structure does not follow a top to bottom hierarchy. The hierarchy will be present but there may be more than one path that the tags follow to reach the last tag within them. The above XML had a uniform structure, since every tag had a parent and there was proper hierarchy from datasource to testcase. ‘Proper’ hierarchy means that there are no parallel paths as is the case with below XML.
Have a look at the following XML:

(-) sign indicates that repository tag is expanded and it contains three tags – platforms, testplans, and executions in parallel. They all have a (+) sign before them indicating that they can be expanded further. On expansion the testplans tag had the following structure:
<testplans>
<testplan>
                       <testplan_id>
                       <testplan_name>
        <builds>
                       <build>
                                      <build_id>
                       <testplan_platforms>
                                      <testplan_testcases>
                                                     <testplan_testcase>
                                                                    <testcase_version_id>

And executions had the structure:
<executions>
        <execution_testplan>
                       <testplan_id>
                       <execution_builds>
                                      <execution_build>
                                                     <build_id>
                                                     <execution_platforms>
                                                                    <execution_platform>
                                                                                   <execution_testcase>
                                                                                                  <testcase_version_id>
                                                                                                  <execution_testcase>
                                                                                                  <defects>
                                                                                                                 <defect>
                                                                                                                                <defect_id>       

So, as you can see the executions does not have the testplan name, just the testplan id. So, if you are preparing and csv file that must contain a defect or a execution_testcase and you need to access the testplan_name it can be done using the following XSLT code:
a)    First access the testplan_id in executions tag and store it in variable TP1:

Notice the number of ‘../’ used. The code was written for defects (hitting each defect tag), so the ‘for-each’ was something like this:

Hence, to get the testplan_id we needed 7 ‘../’s. The parenthood moves like this -
<defect>à<defects>à<execution_testcase>à<execution_platform>à<execution_platforms>à<execution_build>à<execution_builds>à<execution_testplan>
Just count the number of à here. Those many ‘../’s are required.
b)    Next thing was to move out of executions completely to reach the repository or its parent tag.
And then reach out to either ‘testplan_testcase’ or ‘testplan’ directly and retrieve the ‘testplan_id’ and ‘testplan_name’, if we get a match with TP1. At the ‘testplan_testcase’ level, we have to move 2 levels up to get the corresponding ‘testplan_id’.  All this is happening within the ‘testplans’ tag.

Notice the use of ‘//’ to jump to ‘testplan_testcase’ and then accessing its ‘testplan_id’ and checking for its first equivalent. We just needed the first occurence here so [1] was used to get first value and continue. The complete code snippet looked as follows:

Within the internal ‘for-each’ we are taking the ‘testplan_id’ and ‘testplan_name’ from the ‘testplans’ tag. Then we come out of it to get ‘build_id’ and ‘platform_id’ for the ‘defect’. Note the use of ../ several times to get the respective values. Again, the number of ../ depends upon the arrows à (see the above explanation).

Handling a big XML file

Sometimes an XML file becomes too big to handle. In that case XML Starlet will not throw an error, but it will just hang, without doing anything. Sometimes, on the DOS prompt, it displays – ‘Out of Memory’. That is one of the problems that XSLT 2 does not tackle. Actually XML Starlet interpreter wants to load the whole XML file before parsing it, in the memory, but when the file is too big, may be because the file is accessed several times in the code, then it will hang (this might also happen when the number of ‘../’ that we give is wrong, or when we don’t comply with the hierarchy or parenthood of tags, that’s why understanding the structure of XML is extremely important). In that case there is one way we can handle such a case. By making our XML file small, or in other words, creating an intermediate file containing only the stuff that we really need, we can ease our job. This small code does the job for you:



All the tags which are contained in xsl:template tag – summary, preconditions, steps, custom_fields were all removed in the new XML file. The left over XML contained all the test cases that we needed for our use. Using this code the file size reduced from 95MB to 14MB, and we got what we most wanted – the testcase names, testcase ids and testcase version ids.
The other way is to use XSLT 3 enabled XSLT processor such as Saxon PE 9.5. XSLT 3 enables streaming of XML documents, such that the interpreter will not load the complete document in memory, but takes it as the file loads in memory.

Besides this, usage of XML Splitter is recommended to break an XML file into multiple parts so that smaller parts can be handled separately.

No comments:

Post a Comment