Easily extract data from XML using grep and awk

Let’s say you have a scheduled job that runs to import data from an XML file full of records that are passed in from a vendor. Now let’s say that one of these XML files might have 500 records, 1,000 records, even 10,000 records. Now your boss, for some reason, asks you to pull the IDs for each record from that file so that it can be compared against what was actually logged for the day’s import.

Sound familiar?

This is a common type of request and thankfully it is no problem with use of grep, awk, and piping the commands together in Unix.

For the sake of example let’s say that each record is in a format like the following.

<FakeRecord>
<RecordId>12345</RecordId>
<ReqNumber>A1057</ReqNumber>
<SubCode>8474</SubCode>
<MasterCell>A14</MasterCell>
<Serial>019284747</Serial>
</FakeRecord>

What you are looking for is a list of the Record IDs. Running grep can give you a start, to pull just those lines with RecordId present:

> grep 'RecordId' myXmlFile.xml
<RecordId>12345</RecordId>
<RecordId>12346</RecordId>
<RecordId>12347</RecordId>
...
...

So how do you get rid of the “<RecordId>” and “</RecordId>” pieces? You could direct it into a file, then take that file and open it in some text editor, and do a find/replace where you replace them with nothing. But why bother with all that when awk can do it right on the command line:

> grep 'RecordId' myXmlFile.xml | awk -F">" '{print $2}'
12345</RecordId>
12346</RecordId>
12347</RecordId>
...
...

What this has done is pipe the grep results into an awk command. This awk command sets “>” as a marker around which it breaks the line apart. The ‘{print $2}’ piece prints the second block. In this case, it is printing everything after the “>” so we get almost what we want, but not quite.

Now we can repeat the same idea again, and pipe all of this into another awk command which will get rid of that final “</RecordId>”.

> grep 'RecordId' myXmlFile.xml | awk -F">" '{print $2}' | awk -F"<" '{print $1}'
12345
12346
12347
...
...

Now we’ve set the final awk that we pipe the results into to print what comes before the “<” which in this case is the ID numbers.

Bingo. Now you just direct that all into a file for your boss.

> grep 'RecordId' myXmlFile.xml | awk -F">" '{print $2}' | awk -F"<" '{print $1}' > HereYouGoBoss.txt

There are some ways to achieve this same result with a single awk command, but I find this method of piping to be more intuitive, taking the entire line and whittling it down in a couple passes. And since it is a fairly simple task in this case, there’s no need to script it into a single awk. Finally, this is a method even a complete awk beginner can use and understand.

Hope that it helps you the next time you have a request like this from your boss, or whoever it might be.

2 thoughts on “Easily extract data from XML using grep and awk

    • Scott Shipp

      I’m not sure what exactly you’re asking but I’m guessing you’re trying to land on a particular tag? Does your file have multiple tags on one line? If that’s the case, when you run awk -F “>” ‘{print $2}’ try changing the “2” value to get at different segments of the line.

      Let’s say you have lines in your file like this:

      <tag1>blabla</tag1><tag2>Hrm hrm</tag2>

      awk -F “<” is kind of like Java’s String.split(“<“) command. The $1, $2 etc are kind of like calling for array values.

      So if you run this you’ll get part of the string containing the “hrm hrm” value:

      cat my.xml | awk -F “>” ‘{print $4}’

      Then you’d have to pipe additional commands on to whittle it down.

      Hope it helps

Leave a Reply

Your email address will not be published. Required fields are marked *