Skip to content

Commit b13fcee

Browse files
authored
Add xpath.md; upgrade xpath lib to v1.1.11 so we can start to use regexp func matches in xpath query (#124)
1 parent 8a02d06 commit b13fcee

File tree

5 files changed

+336
-4
lines changed

5 files changed

+336
-4
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@ Golang Version: 1.14
1414
Docs:
1515
- [Getting Started](./doc/gettingstarted.md): a tutorial for writing your first omniparser schema.
1616
- [IDR](./doc/idr.md): in-memory data representation of ingested data for omniparser.
17-
- [XPath Based Data Extraction and Filtering](./doc/xpath.md): xpath queries are essential to omniparser schema writing.
18-
Learn the concept and tricks in depth.
17+
- [XPath Based Record Filtering and Data Extraction](./doc/xpath.md): xpath queries are essential to omniparser schema
18+
writing. Learn the concept and tricks in depth.
1919
- [Use of `custom_func`, Specially `javascript`](./doc/use_of_custom_funcs.md): An in depth look of how `custom_func`
2020
is used, specially the all mighty `javascript` (and `javascript_with_context`).
2121
- [CSV Schema in Depth](./doc/csv_in_depth.md): everything about schemas for CSV input.

doc/xpath.md

Lines changed: 330 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,330 @@
1+
# XPath Based Record Filtering and Data Extraction
2+
3+
The foundation of omniparser transform operations is anchored on [IDR](./idr.md) and XPath based record
4+
filtering and data extraction. It's vital to understand each supported file format's IDR structure to
5+
effectively and efficiently craft XPath queries in `transform_declarations` to achieve desire transform
6+
objectives.
7+
8+
## Record Filtering
9+
10+
Many times some records ingested are not suitable/desirable to be transformed into output. Omniparser, more
11+
specifically the current latest version (`"omni.2.1"`) handler, allows record level filtering using XPath
12+
query. Let's see one example in CSV:
13+
14+
```
15+
ORDER_ID,CUSTOMER_ID,COUNTRY
16+
1234,CUST_1,US
17+
N/A
18+
1235,CUST_2,AU
19+
```
20+
21+
We want omniparser to ingest and transform records with `order_id=1234,1235` and skip the line with
22+
`'N/A'`. To achieve this, we can insert `xpath` into the root `object` of `FINAL_OUTPUT` in
23+
`transform_declarations`:
24+
25+
```
26+
"transform_declarations": {
27+
"FINAL_OUTPUT": { "xpath": ".[matches(ORDER_ID, '^[0-9]+$')]", "object": {
28+
...
29+
}}
30+
}
31+
```
32+
33+
Let's take a look how the transform works for first data line `1234,CUST_1,US`:
34+
1. Omniparser reads the first line in and converts it into a [CSV specific IDR tree](./idr.md#csv-aka-delimited):
35+
```
36+
Node(Type: DocumentNode)
37+
Node(Type: ElementNode, Data: "ORDER_ID")
38+
Node(Type: TextNode, Data: "1234")
39+
Node(Type: ElementNode, Data: "CUSTOMER_ID")
40+
Node(Type: TextNode, Data: "CUST_1")
41+
Node(Type: ElementNode, Data: "COUNTRY")
42+
Node(Type: TextNode, Data: "US")
43+
```
44+
2. `FINAL_OUTPUT.xpath` is then executed at the root of the IDR tree, and result is a match! So this
45+
line/record will be processed.
46+
47+
Now take a look the second line `N/A`:
48+
1. The IDR tree looks like:
49+
```
50+
Node(Type: DocumentNode)
51+
Node(Type: ElementNode, Data: "ORDER_ID")
52+
Node(Type: TextNode, Data: "N/A")
53+
```
54+
2. `FINAL_OUTPUT.xpath` is executed at the root of the IDR tree, and result is not a match. This line/record
55+
will be skipped.
56+
57+
Each input format has its own unique IDR structure, record filtering XPath needs to take it into consideration
58+
to be effective.
59+
60+
Clever use of positive/negative regexp [`matches`](https://github.com/antchfx/xpath#expressions) (slightly
61+
slower but very powerful), or [`starts-with`, `ends-with`, `contains`](https://github.com/antchfx/xpath#expressions),
62+
or even direct string comparisons [`==`, `!=`](https://github.com/antchfx/xpath#expressions) in
63+
`FINAL_OUTPUT.xpath` gives schema writers the freedom of either processing certain lines/records, or skipping
64+
certain lines/records.
65+
66+
If `FINAL_OUTPUT` doesn't have `xpath`, which is fairly common, then there is no record filtering, meaning
67+
all records ingested by omniparser file format specific readers will be processed and transformed.
68+
69+
## Data Extraction
70+
71+
The most common use of `xpath` is for data extraction. Consider again the sample CSV and schema in
72+
[Record Filtering](#record-filtering), let's amend the schema to:
73+
```
74+
"transform_declarations": {
75+
"FINAL_OUTPUT": { "xpath": ".[matches(ORDER_ID, '^[0-9]+$')]", "object": {
76+
"order_id": { "xpath": "ORDER_ID", "type": "int" },
77+
"customer_id": { "xpath": "CUSTOMER_ID", "type": "int" },
78+
"country": { "xpath": "COUNTRY" }
79+
}}
80+
}
81+
```
82+
83+
The `xpath` attributes on `"order_id"`, `"customer_id"`, and `"country"` tell omniparser where to get
84+
the field string data from. When `xpath` **not** appearing with `object`, `template`, `custom_func`, or
85+
`custom_parse`, then it is a data extraction directive telling omniparser to extract the text data at the
86+
location specified by the `xpath` query. Note in this situation, omniparser will require the result set of
87+
such `xpath` queries to be of a single node: if such `xpath` query results in more than one node, omniparser
88+
will fail the current record transform (but will continue onto the next one as this isn't considered fatal).
89+
90+
## Data Context and Anchoring
91+
92+
Whether `xpath` is used for record filtering or data extraction/anchoring, it's always good to know the
93+
current IDR tree "cursor" position against which an `xpath` query, if present, will be executed.
94+
95+
The current "cursor" position when a transform of `FINAL_OUTPUT` starts is always at the top of an IDR tree.
96+
So record filtering `FINAL_OUTPUT.xpath` is always executed against the root fo the IDR tree. The "cursor"
97+
position remains unchanged until a new anchoring `xpath` is encountered. Typically, schema writers will need
98+
to change cursor anchoring positions more often in hierarchical file formats, such as EDI/JSON/XML, than
99+
"flat" file formats, like fixed-length or CSV.
100+
101+
Let's take a look at a [sample schema](../extensions/omniv21/samples/json/2_multiple_objects.schema.json)
102+
for JSON input:
103+
104+
```
105+
1 "transform_declarations": {
106+
2 "FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {
107+
3 "authors": { "array": [ { "xpath": "books/*/author" } ] },
108+
4 "book_titles": { "array": [ { "xpath": "books/*/title" } ] },
109+
5 "books": { "array": [ { "xpath": "books/*", "object": {
110+
6 "author": { "xpath": "author" },
111+
7 "year": { "xpath": "year", "type": "int" },
112+
8 "price": { "xpath": "price", "type": "float" },
113+
9 "title": { "xpath": "title" }
114+
10 }} ] },
115+
11 "publisher": { "xpath": "name" },
116+
12 "first_book": { "xpath": "books/*[position() = 1]", "custom_func": { "name": "copy" }},
117+
13 "original_book_array": { "xpath": "books", "custom_func": { "name": "copy" }}
118+
41 }}
119+
42 }
120+
```
121+
Notes:
122+
- Line numbers are added for easier reference.
123+
- Only `transform_declarations` section is included here for brevity.
124+
125+
Consider this [input](../extensions/omniv21/samples/json/2_multiple_objects.input.json):
126+
```
127+
1 {
128+
2 "publishers": [
129+
3 {
130+
4 "name": "Scholastic Press",
131+
5 "books": [
132+
6 {
133+
7 "title": "Harry Potter and the Philosopher's Stone",
134+
8 "price": 9.99,
135+
9 "author": "J. K. Rowling",
136+
10 "year": 1997
137+
11 },
138+
12 {
139+
13 "title": "Harry Potter and the Chamber of Secrets",
140+
14 "price": 10.99,
141+
15 "author": "J. K. Rowling",
142+
16 "year": 1998
143+
17 }
144+
18 ]
145+
19 }
146+
20 }
147+
```
148+
149+
Now let's go through the schema and input together to see how `xpath` anchoring is used.
150+
151+
1. schema `2 "FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {`
152+
153+
This is record filtering, saying, we'd like to process and transform every record matching
154+
`/publishers/*`. In this simplified input example, there is only one JSON object matches it: it's the
155+
object starting at line 3 and finishing at line 19. With this line, the transform starts, and now the
156+
cursor is anchored at the top of this object.
157+
158+
2. schema `3 "authors": { "array": [ { "xpath": "books/*/author" } ] },`
159+
160+
Unlike `object` transform, `array` transform itself doesn't/may not have `xpath` attribute: an `array`
161+
transform is a collection of child transforms, each of which can optionally have its own `xpath`.
162+
This schema line says, `authors` in the output is an array, of which, each element is a string whose
163+
value comes from the `xpath` data extraction `books/*/author`. So with the input above, we will have
164+
`"authors": [ "J. K. Rowling", "J. K. Rowling" ]` in the final output.
165+
166+
3. schema `4 "book_titles": { "array": [ { "xpath": "books/*/title" } ] },`
167+
168+
Very similar to `authors` output above, `book_titles` output will be like:
169+
`"book_titles": [ "Harry Potter and the Philosopher's Stone", "Harry Potter and the Chamber of Secrets" ]`
170+
in the final output.
171+
172+
4. schema `5 "books": { "array": [ { "xpath": "books/*", "object": {`
173+
174+
Similar to `authors` and `book_titles` above, what this line says is, `books` in the output should be an
175+
array of objects, each of which, the IDR cursor should be anchored on `books/*` for its processing and
176+
transform. In other words, omniparser will anchor the IDR cursor on the JSON object from line 6 through
177+
line 11 for the first array element object transform, and then anchor on the JSON object from line 12
178+
through line 17 for the second array element object transform.
179+
180+
5. schema `6 "author": { "xpath": "author" },` and through line 9
181+
Recall in 4., omniparser has put the cursor on actual book object. Now line 6 through line 9 simply
182+
extract string values from the object and put into the corresponding output fields.
183+
184+
6. schema `12 "first_book": { "xpath": "books/*[position() = 1]", "custom_func": { "name": "copy" }},`
185+
186+
This is an interesting schema construct: we want `first_book` in the output to be a direct copy of the
187+
first book object inside input's `books` JSON array. `"xpath": "books/*[position() = 1]"` achieves the
188+
"only first book object" filtering. `"custom_func": { "name": "copy" }` achieves the direct copying.
189+
190+
As you notice, `custom_func` transform can have (optional) `xpath` attribute as well. If `xpath` is present
191+
for a `custom_func`, then everything inside the `custom_func`, namely those argument transforms, are all
192+
anchored on the cursor position prescribed by the `xpath`.
193+
194+
When `xpath` is used for anchoring and cursoring, it can appear with `object`, `template`, `custom_func`, and
195+
`custom_parse`.
196+
197+
## Static and Dynamic XPath Queries
198+
199+
While `xpath` is the most commonly used filtering, anchoring and data extraction directive in schemas, it (the
200+
query itself) is completely static, meaning the query is fixed and static at schema writing time, thus can't
201+
be used where data dependent runtime dynamic query is needed.
202+
203+
Consider the following [sample input](../extensions/omniv21/samples/json/3_xpathdynamic.input.json):
204+
```
205+
[
206+
{
207+
"line_items": [
208+
{
209+
"product": {
210+
"variant": {
211+
"option2": "Blue",
212+
"option1": "M"
213+
},
214+
"options": [
215+
{
216+
"index": 2,
217+
"name": "color/pattern",
218+
"values": [
219+
"Blue",
220+
"Green"
221+
]
222+
},
223+
{
224+
"index": 1,
225+
"name": "Size",
226+
"values": [
227+
"M",
228+
"L"
229+
]
230+
}
231+
]
232+
}
233+
}
234+
]
235+
}
236+
]
237+
```
238+
Notice the `options` array specifies what allowed/possible options are for a product and then in `variant`
239+
of `product`, it specifies what actual options are included.
240+
241+
The [sample schema](../extensions/omniv21/samples/json/3_xpathdynamic.schema.json):
242+
```
243+
"transform_declarations": {
244+
"FINAL_OUTPUT": { "xpath": "/*", "object": {
245+
"order_info": { "object": {
246+
"order_items": { "array": [
247+
{ "xpath": "line_items/*", "object": {
248+
....
249+
"color": { "xpath_dynamic": {
250+
"custom_func": {
251+
"name": "concat",
252+
"args": [
253+
{ "const": "product/variant/option" },
254+
{ "xpath": "product/options/*[name='color/pattern']/index" }
255+
]
256+
}
257+
}},
258+
"size": { "xpath_dynamic": {
259+
"custom_func": {
260+
"name": "concat",
261+
"args": [
262+
{ "const": "product/variant/option" },
263+
{ "xpath": "product/options/*[name='Size']/index" }
264+
]
265+
}
266+
}},
267+
...
268+
}}
269+
]}
270+
}}
271+
}}
272+
}
273+
```
274+
275+
The schema wants to transform `optoin1` and `option2` in the input into `color` and `size` in output. The
276+
difficulty is how to figure out `optoin1` is mapped to `color` and `option2` to `size`. If we look at the
277+
input's `options` array, it says `"index": 1` is for size and `"index": 2` is for color. To extract data
278+
for `color` field in the output, we need to dynamically construct an XPath query by
279+
`product/variant/option` + `product/options/*[name='color/pattern']/index`. Similar XPath construction is
280+
needed for `size` field data extraction.
281+
282+
`xpath_dynamic` is used in such a situation. It basically says, unlike `xpath` is always a constant and static
283+
string value, `xpath_dynamic` is computed, by either `custom_func`, or `custom_parse`, or `template`, or
284+
`external`, or `const`, or another `xpath` direct data extraction.
285+
286+
`xpath_dynamic` can be used everywhere `xpath` is used, except on `FINAL_OUTPUT`. `FINAL_OUTPUT` can only
287+
use `xpath`.
288+
289+
## XPath Query Result-set Cardinality
290+
291+
Everytime when an `xpath` or `xpath_dynamic` query is executed against an IDR node (and its subtree), the
292+
result is always a set of nodes: could be an empty set, or a set of one node, or a set of multiple nodes.
293+
Depending on which transform is in play, different outcomes, including error, can follow.
294+
295+
- `xpath`/`xpath_dynamic` used alone, aka data extraction transform:
296+
297+
- Example: `"field1": { "xpath": "PATH/TO/DATA" }`
298+
- The result set must be either empty or of a single node. When empty, `""` is used; when a single
299+
node is returned for the query, the node's text data will be used; when more than one node is returned,
300+
omniparser will return a transform error (non-fatal).
301+
302+
- `xpath` used in `FINAL_OUTPUT`:
303+
304+
- Example: `"FINAL_OUTPUT": { "xpath": "/publishers/*", "object": {`
305+
- The result set can be either empty, or of one node, or of multiple nodes.
306+
307+
- `xpath`/`xpath_dynamic` used in `object`, `custom_func`, `custom_parse`, `template` transform
308+
(other than `FINAL_OUTPUT` or directly under an `array` transform):
309+
310+
- Example: `"contact": { "xpath": "PATH/TO/CONTACT", "object": {`
311+
- Example: `"temperature": { "xpath": "PATH/TO/TEMPERATURE", "custom_func": {`
312+
- Example: `"wind_forecast": { "xpath": "PATH/TO/WIND", "template": {`
313+
- The result set can only be either empty or of one node. Multiple node result set will cause parser error.
314+
315+
- `xpath`/`xpath_dynamic` used in transform that is directly under `array` transform:
316+
317+
- Example: `"titles": { "array": [ { "xpath": "books/*/title" } ] }`
318+
- Example: `"titles": { "array": [ { "xpath": "books/*/title" }, { "xpath": "movies/*/title" } ] }`
319+
- The first example is the most commonly used scenario, that is, the `array` contains homogeneous element
320+
transforms. In this case, the `xpath` can return empty, or one node, or multiple nodes and results will
321+
be used as the array's elements.
322+
- The second example shows the flexibility of `array` transform, that it can contain different transforms:
323+
one set of titles is of book titles and another set of movie titles. All titles, books' or movies', are
324+
contained by the array. Similar to the first case, both `xpath` result sets can return empty, one node or
325+
multiple nodes. All are fine and accepted by the parser.
326+
327+
## Supported XPath Features
328+
329+
Omniparser relies on https://github.com/antchfx/xpath (thank you!) for XPath query parsing and execution.
330+
Check its github page for the full syntax and function support list.

extensions/omniv21/samples/customfileformats/jsonlog/sample_schema.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"file_format_type": "jsonlog"
55
},
66
"transform_declarations": {
7-
"FINAL_OUTPUT": { "xpath": ".[(severity='WARNING' or severity='ERROR' or severity='CRITICAL') and source='api']", "object": {
7+
"FINAL_OUTPUT": { "xpath": ".[matches(severity, '^(WARNING|ERROR|CRITICAL)$') and source='api']", "object": {
88
"timestamp": { "xpath": "timestamp" },
99
"source": { "const": "api" },
1010
"severity": { "custom_func": {

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ go 1.14
44

55
require (
66
github.com/antchfx/xmlquery v1.3.1
7-
github.com/antchfx/xpath v1.1.10
7+
github.com/antchfx/xpath v1.1.11
88
github.com/bradleyjkemp/cupaloy v2.3.0+incompatible
99
github.com/dlclark/regexp2 v1.2.1 // indirect
1010
github.com/dop251/goja v0.0.0-20201002140143-8ce18d86df5f

go.sum

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ github.com/antchfx/xmlquery v1.3.1 h1:nIKWdtnhrXtj0/IRUAAw2I7TfpHUa3zMnHvNmPXFg+
77
github.com/antchfx/xmlquery v1.3.1/go.mod h1:64w0Xesg2sTaawIdNqMB+7qaW/bSqkQm+ssPaCMWNnc=
88
github.com/antchfx/xpath v1.1.10 h1:cJ0pOvEdN/WvYXxvRrzQH9x5QWKpzHacYO8qzCcDYAg=
99
github.com/antchfx/xpath v1.1.10/go.mod h1:Yee4kTMuNiPYJ7nSNorELQMr1J33uOpXDMByNYhvtNk=
10+
github.com/antchfx/xpath v1.1.11 h1:WOFtK8TVAjLm3lbgqeP0arlHpvCEeTANeWZ/csPpJkQ=
11+
github.com/antchfx/xpath v1.1.11/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
1012
github.com/armon/consul-api v0.0.0-20180202201655-eb2c6b5be1b6/go.mod h1:grANhF5doyWs3UAsr3K4I6qtAmlQcZDesFNEHPZAzj8=
1113
github.com/beorn7/perks v0.0.0-20180321164747-3a771d992973/go.mod h1:Dwedo/Wpr24TaqPxmxbtue+5NUziq4I4S80YR8gNf3Q=
1214
github.com/beorn7/perks v1.0.0/go.mod h1:KWe93zE9D1o94FZ5RNwFwVgaQK1VOXiVxmqh+CedLV8=

0 commit comments

Comments
 (0)