|
| 1 | +# XPath Based Record Filtering and Data Extraction |
| 2 | + |
| 3 | +The foundation of omniparser transform operations is anchored on [IDR](./idr.md) and XPath based record |
| 4 | +filtering and data extraction. It's vital to understand each supported file format's IDR structure to |
| 5 | +effectively and efficiently craft XPath queries in `transform_declarations` to achieve desire transform |
| 6 | +objectives. |
| 7 | + |
| 8 | +## Record Filtering |
| 9 | + |
| 10 | +Many times some records ingested are not suitable/desirable to be transformed into output. Omniparser, more |
| 11 | +specifically the current latest version (`"omni.2.1"`) handler, allows record level filtering using XPath |
| 12 | +query. Let's see one example in CSV: |
| 13 | + |
| 14 | +``` |
| 15 | +ORDER_ID,CUSTOMER_ID,COUNTRY |
| 16 | +1234,CUST_1,US |
| 17 | +N/A |
| 18 | +1235,CUST_2,AU |
| 19 | +``` |
| 20 | + |
| 21 | +We want omniparser to ingest and transform records with `order_id=1234,1235` and skip the line with |
| 22 | +`'N/A'`. To achieve this, we can insert `xpath` into the root `object` of `FINAL_OUTPUT` in |
| 23 | +`transform_declarations`: |
| 24 | + |
| 25 | +``` |
| 26 | + "transform_declarations": { |
| 27 | + "FINAL_OUTPUT": { "xpath": ".[matches(ORDER_ID, '^[0-9]+$')]", "object": { |
| 28 | + ... |
| 29 | + }} |
| 30 | + } |
| 31 | +``` |
| 32 | + |
| 33 | +Let's take a look how the transform works for first data line `1234,CUST_1,US`: |
| 34 | +1. Omniparser reads the first line in and converts it into a [CSV specific IDR tree](./idr.md#csv-aka-delimited): |
| 35 | + ``` |
| 36 | + Node(Type: DocumentNode) |
| 37 | + Node(Type: ElementNode, Data: "ORDER_ID") |
| 38 | + Node(Type: TextNode, Data: "1234") |
| 39 | + Node(Type: ElementNode, Data: "CUSTOMER_ID") |
| 40 | + Node(Type: TextNode, Data: "CUST_1") |
| 41 | + Node(Type: ElementNode, Data: "COUNTRY") |
| 42 | + Node(Type: TextNode, Data: "US") |
| 43 | + ``` |
| 44 | +2. `FINAL_OUTPUT.xpath` is then executed at the root of the IDR tree, and result is a match! So this |
| 45 | +line/record will be processed. |
| 46 | +
|
| 47 | +Now take a look the second line `N/A`: |
| 48 | +1. The IDR tree looks like: |
| 49 | + ``` |
| 50 | + Node(Type: DocumentNode) |
| 51 | + Node(Type: ElementNode, Data: "ORDER_ID") |
| 52 | + Node(Type: TextNode, Data: "N/A") |
| 53 | + ``` |
| 54 | +2. `FINAL_OUTPUT.xpath` is executed at the root of the IDR tree, and result is not a match. This line/record |
| 55 | +will be skipped. |
| 56 | +
|
| 57 | +Each input format has its own unique IDR structure, record filtering XPath needs to take it into consideration |
| 58 | +to be effective. |
| 59 | +
|
| 60 | +Clever use of positive/negative regexp [`matches`](https://github.com/antchfx/xpath#expressions) (slightly |
| 61 | +slower but very powerful), or [`starts-with`, `ends-with`, `contains`](https://github.com/antchfx/xpath#expressions), |
| 62 | +or even direct string comparisons [`==`, `!=`](https://github.com/antchfx/xpath#expressions) in |
| 63 | +`FINAL_OUTPUT.xpath` gives schema writers the freedom of either processing certain lines/records, or skipping |
| 64 | +certain lines/records. |
| 65 | +
|
| 66 | +If `FINAL_OUTPUT` doesn't have `xpath`, which is fairly common, then there is no record filtering, meaning |
| 67 | +all records ingested by omniparser file format specific readers will be processed and transformed. |
| 68 | +
|
| 69 | +## Data Extraction |
| 70 | +
|
| 71 | +The most common use of `xpath` is for data extraction. Consider again the sample CSV and schema in |
| 72 | +[Record Filtering](#record-filtering), let's amend the schema to: |
| 73 | +``` |
| 74 | + "transform_declarations": { |
| 75 | + "FINAL_OUTPUT": { "xpath": ".[matches(ORDER_ID, '^[0-9]+$')]", "object": { |
| 76 | + "order_id": { "xpath": "ORDER_ID", "type": "int" }, |
| 77 | + "customer_id": { "xpath": "CUSTOMER_ID", "type": "int" }, |
| 78 | + "country": { "xpath": "COUNTRY" } |
| 79 | + }} |
| 80 | + } |
| 81 | +``` |
| 82 | +
|
| 83 | +The `xpath` attributes on `"order_id"`, `"customer_id"`, and `"country"` tell omniparser where to get |
| 84 | +the field string data from. When `xpath` **not** appearing with `object`, `template`, `custom_func`, or |
| 85 | +`custom_parse`, then it is a data extraction directive telling omniparser to extract the text data at the |
| 86 | +location specified by the `xpath` query. Note in this situation, omniparser will require the result set of |
| 87 | +such `xpath` queries to be of a single node: if such `xpath` query results in more than one node, omniparser |
| 88 | +will fail the current record transform (but will continue onto the next one as this isn't considered fatal). |
| 89 | +
|
| 90 | +## Data Context and Anchoring |
| 91 | +
|
| 92 | +Whether `xpath` is used for record filtering or data extraction/anchoring, it's always good to know the |
| 93 | +current IDR tree "cursor" position against which an `xpath` query, if present, will be executed. |
| 94 | +
|
| 95 | +The current "cursor" position when a transform of `FINAL_OUTPUT` starts is always at the top of an IDR tree. |
| 96 | +So record filtering `FINAL_OUTPUT.xpath` is always executed against the root fo the IDR tree. The "cursor" |
| 97 | +position remains unchanged until a new anchoring `xpath` is encountered. Typically, schema writers will need |
| 98 | +to change cursor anchoring positions more often in hierarchical file formats, such as EDI/JSON/XML, than |
| 99 | +"flat" file formats, like fixed-length or CSV. |
| 100 | +
|
| 101 | +Let's take a look at a [sample schema](../extensions/omniv21/samples/json/2_multiple_objects.schema.json) |
| 102 | +for JSON input: |
| 103 | +
|
| 104 | +``` |
| 105 | +1 "transform_declarations": { |
| 106 | +2 "FINAL_OUTPUT": { "xpath": "/publishers/*", "object": { |
| 107 | +3 "authors": { "array": [ { "xpath": "books/*/author" } ] }, |
| 108 | +4 "book_titles": { "array": [ { "xpath": "books/*/title" } ] }, |
| 109 | +5 "books": { "array": [ { "xpath": "books/*", "object": { |
| 110 | +6 "author": { "xpath": "author" }, |
| 111 | +7 "year": { "xpath": "year", "type": "int" }, |
| 112 | +8 "price": { "xpath": "price", "type": "float" }, |
| 113 | +9 "title": { "xpath": "title" } |
| 114 | +10 }} ] }, |
| 115 | +11 "publisher": { "xpath": "name" }, |
| 116 | +12 "first_book": { "xpath": "books/*[position() = 1]", "custom_func": { "name": "copy" }}, |
| 117 | +13 "original_book_array": { "xpath": "books", "custom_func": { "name": "copy" }} |
| 118 | +41 }} |
| 119 | +42 } |
| 120 | +``` |
| 121 | +Notes: |
| 122 | +- Line numbers are added for easier reference. |
| 123 | +- Only `transform_declarations` section is included here for brevity. |
| 124 | +
|
| 125 | +Consider this [input](../extensions/omniv21/samples/json/2_multiple_objects.input.json): |
| 126 | +``` |
| 127 | +1 { |
| 128 | +2 "publishers": [ |
| 129 | +3 { |
| 130 | +4 "name": "Scholastic Press", |
| 131 | +5 "books": [ |
| 132 | +6 { |
| 133 | +7 "title": "Harry Potter and the Philosopher's Stone", |
| 134 | +8 "price": 9.99, |
| 135 | +9 "author": "J. K. Rowling", |
| 136 | +10 "year": 1997 |
| 137 | +11 }, |
| 138 | +12 { |
| 139 | +13 "title": "Harry Potter and the Chamber of Secrets", |
| 140 | +14 "price": 10.99, |
| 141 | +15 "author": "J. K. Rowling", |
| 142 | +16 "year": 1998 |
| 143 | +17 } |
| 144 | +18 ] |
| 145 | +19 } |
| 146 | +20 } |
| 147 | +``` |
| 148 | +
|
| 149 | +Now let's go through the schema and input together to see how `xpath` anchoring is used. |
| 150 | +
|
| 151 | +1. schema `2 "FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {` |
| 152 | +
|
| 153 | + This is record filtering, saying, we'd like to process and transform every record matching |
| 154 | + `/publishers/*`. In this simplified input example, there is only one JSON object matches it: it's the |
| 155 | + object starting at line 3 and finishing at line 19. With this line, the transform starts, and now the |
| 156 | + cursor is anchored at the top of this object. |
| 157 | +
|
| 158 | +2. schema `3 "authors": { "array": [ { "xpath": "books/*/author" } ] },` |
| 159 | +
|
| 160 | + Unlike `object` transform, `array` transform itself doesn't/may not have `xpath` attribute: an `array` |
| 161 | + transform is a collection of child transforms, each of which can optionally have its own `xpath`. |
| 162 | + This schema line says, `authors` in the output is an array, of which, each element is a string whose |
| 163 | + value comes from the `xpath` data extraction `books/*/author`. So with the input above, we will have |
| 164 | + `"authors": [ "J. K. Rowling", "J. K. Rowling" ]` in the final output. |
| 165 | +
|
| 166 | +3. schema `4 "book_titles": { "array": [ { "xpath": "books/*/title" } ] },` |
| 167 | +
|
| 168 | + Very similar to `authors` output above, `book_titles` output will be like: |
| 169 | + `"book_titles": [ "Harry Potter and the Philosopher's Stone", "Harry Potter and the Chamber of Secrets" ]` |
| 170 | + in the final output. |
| 171 | +
|
| 172 | +4. schema `5 "books": { "array": [ { "xpath": "books/*", "object": {` |
| 173 | +
|
| 174 | + Similar to `authors` and `book_titles` above, what this line says is, `books` in the output should be an |
| 175 | + array of objects, each of which, the IDR cursor should be anchored on `books/*` for its processing and |
| 176 | + transform. In other words, omniparser will anchor the IDR cursor on the JSON object from line 6 through |
| 177 | + line 11 for the first array element object transform, and then anchor on the JSON object from line 12 |
| 178 | + through line 17 for the second array element object transform. |
| 179 | +
|
| 180 | +5. schema `6 "author": { "xpath": "author" },` and through line 9 |
| 181 | + Recall in 4., omniparser has put the cursor on actual book object. Now line 6 through line 9 simply |
| 182 | + extract string values from the object and put into the corresponding output fields. |
| 183 | +
|
| 184 | +6. schema `12 "first_book": { "xpath": "books/*[position() = 1]", "custom_func": { "name": "copy" }},` |
| 185 | +
|
| 186 | + This is an interesting schema construct: we want `first_book` in the output to be a direct copy of the |
| 187 | + first book object inside input's `books` JSON array. `"xpath": "books/*[position() = 1]"` achieves the |
| 188 | + "only first book object" filtering. `"custom_func": { "name": "copy" }` achieves the direct copying. |
| 189 | +
|
| 190 | + As you notice, `custom_func` transform can have (optional) `xpath` attribute as well. If `xpath` is present |
| 191 | + for a `custom_func`, then everything inside the `custom_func`, namely those argument transforms, are all |
| 192 | + anchored on the cursor position prescribed by the `xpath`. |
| 193 | +
|
| 194 | +When `xpath` is used for anchoring and cursoring, it can appear with `object`, `template`, `custom_func`, and |
| 195 | +`custom_parse`. |
| 196 | +
|
| 197 | +## Static and Dynamic XPath Queries |
| 198 | +
|
| 199 | +While `xpath` is the most commonly used filtering, anchoring and data extraction directive in schemas, it (the |
| 200 | +query itself) is completely static, meaning the query is fixed and static at schema writing time, thus can't |
| 201 | +be used where data dependent runtime dynamic query is needed. |
| 202 | +
|
| 203 | +Consider the following [sample input](../extensions/omniv21/samples/json/3_xpathdynamic.input.json): |
| 204 | +``` |
| 205 | +[ |
| 206 | + { |
| 207 | + "line_items": [ |
| 208 | + { |
| 209 | + "product": { |
| 210 | + "variant": { |
| 211 | + "option2": "Blue", |
| 212 | + "option1": "M" |
| 213 | + }, |
| 214 | + "options": [ |
| 215 | + { |
| 216 | + "index": 2, |
| 217 | + "name": "color/pattern", |
| 218 | + "values": [ |
| 219 | + "Blue", |
| 220 | + "Green" |
| 221 | + ] |
| 222 | + }, |
| 223 | + { |
| 224 | + "index": 1, |
| 225 | + "name": "Size", |
| 226 | + "values": [ |
| 227 | + "M", |
| 228 | + "L" |
| 229 | + ] |
| 230 | + } |
| 231 | + ] |
| 232 | + } |
| 233 | + } |
| 234 | + ] |
| 235 | + } |
| 236 | +] |
| 237 | +``` |
| 238 | +Notice the `options` array specifies what allowed/possible options are for a product and then in `variant` |
| 239 | +of `product`, it specifies what actual options are included. |
| 240 | +
|
| 241 | +The [sample schema](../extensions/omniv21/samples/json/3_xpathdynamic.schema.json): |
| 242 | +``` |
| 243 | + "transform_declarations": { |
| 244 | + "FINAL_OUTPUT": { "xpath": "/*", "object": { |
| 245 | + "order_info": { "object": { |
| 246 | + "order_items": { "array": [ |
| 247 | + { "xpath": "line_items/*", "object": { |
| 248 | + .... |
| 249 | + "color": { "xpath_dynamic": { |
| 250 | + "custom_func": { |
| 251 | + "name": "concat", |
| 252 | + "args": [ |
| 253 | + { "const": "product/variant/option" }, |
| 254 | + { "xpath": "product/options/*[name='color/pattern']/index" } |
| 255 | + ] |
| 256 | + } |
| 257 | + }}, |
| 258 | + "size": { "xpath_dynamic": { |
| 259 | + "custom_func": { |
| 260 | + "name": "concat", |
| 261 | + "args": [ |
| 262 | + { "const": "product/variant/option" }, |
| 263 | + { "xpath": "product/options/*[name='Size']/index" } |
| 264 | + ] |
| 265 | + } |
| 266 | + }}, |
| 267 | + ... |
| 268 | + }} |
| 269 | + ]} |
| 270 | + }} |
| 271 | + }} |
| 272 | + } |
| 273 | +``` |
| 274 | +
|
| 275 | +The schema wants to transform `optoin1` and `option2` in the input into `color` and `size` in output. The |
| 276 | +difficulty is how to figure out `optoin1` is mapped to `color` and `option2` to `size`. If we look at the |
| 277 | +input's `options` array, it says `"index": 1` is for size and `"index": 2` is for color. To extract data |
| 278 | +for `color` field in the output, we need to dynamically construct an XPath query by |
| 279 | +`product/variant/option` + `product/options/*[name='color/pattern']/index`. Similar XPath construction is |
| 280 | +needed for `size` field data extraction. |
| 281 | +
|
| 282 | +`xpath_dynamic` is used in such a situation. It basically says, unlike `xpath` is always a constant and static |
| 283 | +string value, `xpath_dynamic` is computed, by either `custom_func`, or `custom_parse`, or `template`, or |
| 284 | +`external`, or `const`, or another `xpath` direct data extraction. |
| 285 | +
|
| 286 | +`xpath_dynamic` can be used everywhere `xpath` is used, except on `FINAL_OUTPUT`. `FINAL_OUTPUT` can only |
| 287 | +use `xpath`. |
| 288 | +
|
| 289 | +## XPath Query Result-set Cardinality |
| 290 | +
|
| 291 | +Everytime when an `xpath` or `xpath_dynamic` query is executed against an IDR node (and its subtree), the |
| 292 | +result is always a set of nodes: could be an empty set, or a set of one node, or a set of multiple nodes. |
| 293 | +Depending on which transform is in play, different outcomes, including error, can follow. |
| 294 | +
|
| 295 | +- `xpath`/`xpath_dynamic` used alone, aka data extraction transform: |
| 296 | +
|
| 297 | + - Example: `"field1": { "xpath": "PATH/TO/DATA" }` |
| 298 | + - The result set must be either empty or of a single node. When empty, `""` is used; when a single |
| 299 | + node is returned for the query, the node's text data will be used; when more than one node is returned, |
| 300 | + omniparser will return a transform error (non-fatal). |
| 301 | +
|
| 302 | +- `xpath` used in `FINAL_OUTPUT`: |
| 303 | +
|
| 304 | + - Example: `"FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {` |
| 305 | + - The result set can be either empty, or of one node, or of multiple nodes. |
| 306 | +
|
| 307 | +- `xpath`/`xpath_dynamic` used in `object`, `custom_func`, `custom_parse`, `template` transform |
| 308 | +(other than `FINAL_OUTPUT` or directly under an `array` transform): |
| 309 | +
|
| 310 | + - Example: `"contact": { "xpath": "PATH/TO/CONTACT", "object": {` |
| 311 | + - Example: `"temperature": { "xpath": "PATH/TO/TEMPERATURE", "custom_func": {` |
| 312 | + - Example: `"wind_forecast": { "xpath": "PATH/TO/WIND", "template": {` |
| 313 | + - The result set can only be either empty or of one node. Multiple node result set will cause parser error. |
| 314 | +
|
| 315 | +- `xpath`/`xpath_dynamic` used in transform that is directly under `array` transform: |
| 316 | +
|
| 317 | + - Example: `"titles": { "array": [ { "xpath": "books/*/title" } ] }` |
| 318 | + - Example: `"titles": { "array": [ { "xpath": "books/*/title" }, { "xpath": "movies/*/title" } ] }` |
| 319 | + - The first example is the most commonly used scenario, that is, the `array` contains homogeneous element |
| 320 | + transforms. In this case, the `xpath` can return empty, or one node, or multiple nodes and results will |
| 321 | + be used as the array's elements. |
| 322 | + - The second example shows the flexibility of `array` transform, that it can contain different transforms: |
| 323 | + one set of titles is of book titles and another set of movie titles. All titles, books' or movies', are |
| 324 | + contained by the array. Similar to the first case, both `xpath` result sets can return empty, one node or |
| 325 | + multiple nodes. All are fine and accepted by the parser. |
| 326 | +
|
| 327 | +## Supported XPath Features |
| 328 | +
|
| 329 | +Omniparser relies on https://github.com/antchfx/xpath (thank you!) for XPath query parsing and execution. |
| 330 | +Check its github page for the full syntax and function support list. |
0 commit comments