Mix-Ins

In this section we cover some of the included mix-ins and detail the functions that they add.

Core

It is the core set of helpers that can be added to content nodes to allow for searching and tagging.

tag(self, tag_name, type_re=None, content_re=None, use_all_content=False, node_only=False, include_children=False, fixed_position=None, data=None)[source]

This will tag (see Feature Tagging) the expression groups identified by the regular expression.

>>> document.content_node.find(content_re='.*Cheese.*').tag('is_cheese')
Parameters:
  • tag_name – the name of the tag to apply
  • type_re – regular expression to make the type (default .*)
  • content_re – the regular expression that you wish to use to tag, note that we will create a tag for each matching group
  • use_all_content – apply the regular expression to the all_content (include content from child nodes)
  • node_only – Ignore the matching groups and tag the whole node
  • include_children – Include recurse into children and tag where matching
  • fixed_position – use a fixed position, supplied as a tuple i.e. - (4,10) tag from position 4 to 10 (default None)
  • data – Attach the a dictionary of data for the given tag
tag_range(self, start_content_re, end_content_re, tag_name, type_re='.*', use_all_content=False)[source]

This will tag all the child nodes between the start and end content regular expressions

>>> document.content_node.tag_range(start_content_re='.*Cheese.*', end_content_re='.*Fish.*', tag_name='foo')
Parameters:
  • start_content_re – The regular expression to match the starting child
  • end_content_re – The regular expression to match the ending child
  • tag_name – The tag to be applied to the nodes in range
  • type_re – The type to match (default is all)
  • use_all_content – Use full content (including child nodes, default is False)
get_all_content(self, separator=' ')[source]

This will build the complete content, including the content of children.

>>> document.content_node.get_all_content()
"This string is made up of multiple nodes"
Parameters:separator – the separate to use to join the content together (default is ” “)
get_all_tags(self)[source]

Returns a list of the names of the tags on the given node and all its children

>>> document.content_node.find(content_re='.*Cheese.*').get_all_tags()
['is_cheese']
Returns:A list of the tag names
get_tag(self, tag_name)[source]

Returns the value of a tag, this can be either a single list [start,end,value] or if multiple parts of the content of this node match you can end up with a list of lists i.e. [[start1,end1,value1],[start2,end2,value2]]

>>> document.content_node.find(content_re='.*Cheese.*').get_tag('is_cheese')
[0,10,'The Cheese Moved']
Parameters:tag_name – The name of the tag
Returns:The tagged location and value (or a list if more than one)
find(self, content_re='.*', type_re='.*', direction=<FindDirection.CHILDREN: 1>, tag_name=None, instance=0, tag_name_re=None, use_all_content=False)[source]

Search for a node that matches on the value and or type using regular expressions

>>> document.content_node.find(content_re='.*Cheese.*',instance=2)
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
Parameters:
  • content_re – the regular expression to match the nodes content (default ‘.*’)
  • type_re – the regular expression to match the nodes type (default ‘.*’)
  • direction – the direction to search, either FindDirection.CHILDREN or FindDirection.PARENT (default CHILDREN)
  • tag_name – the tag name that must exist
  • instance – the instance to return (0=first)
  • tag_name_re – the regular expression to match for the tag_name
  • use_all_content – match the content for all child nodes to (default False)
Returns:

either an instance of ContentNode (if found), or None

findall(self, content_re='.*', type_re='.*', direction=<FindDirection.CHILDREN: 1>, tag_name=None, tag_name_re=None, use_all_content=False)[source]

Search for nodes that matches on the value and or type using regular expressions

>>> document.content_node.findall(content_re='.*Cheese.*')
[<kodexa.model.model.ContentNode object at 0x7f80605e53c8>,
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>]
Parameters:
  • content_re – the regular expression to match the nodes content (default ‘.*’)
  • type_re – the regular expression to match the nodes type (default ‘.*’)
  • direction – the direction to search, either FindDirection.CHILDREN or FindDirection.PARENT (default CHILDREN)
  • tag_name – the tag name that must exist
  • tag_name_re – the regular expression to match for the tag_name
  • use_all_content – match the content for all child nodes to (default False)
Returns:

list of matching content nodes

find_with_feature_value(self, feature_type, feature, value, direction=<FindDirection.CHILDREN: 1>, instance=1)[source]

Search for a node with a specific feature type, name and value

>>> document.content_node.find_with_feature_value(feature_type='tag',feature='is_cheese',value=[1,10,'The Cheese has moved'])
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
Parameters:
  • feature_type – the feature type
  • feature – the feature name
  • value – the feature value
  • direction – the direction to search, either FindDirection.CHILDREN or FindDirection.PARENT (default CHILDREN)
  • instance – the instance to get (defaul 1)
Returns:

either an instance of ContentNode (if found), or None

findall_with_feature_value(self, feature_type, feature_name, value, direction=<FindDirection.CHILDREN: 1>)[source]

Search for all nodes with a specific feature type, name and value

>>> document.content_node.findall_with_feature_value(feature_type='tag',feature='is_cheese', value=[1,10,'The Cheese has moved'])
[<kodexa.model.model.ContentNode object at 0x7f80605e53c8>]
Parameters:
  • feature_type – the feature type
  • feature_name – the feature name
  • value – the feature value
  • direction – the direction to search, either FindDirection.CHILDREN or FindDirection.PARENT (default CHILDREN)
Returns:

list of the matching content nodes

get_tags(self)[source]

Returns a list of the names of the tags on the given node

>>> document.content_node.find(content_re='.*Cheese.*').get_tags()
['is_cheese']
Returns:A list of the tag name
move_child_to_parent(self, target_child, target_parent)[source]

This will move the target_child, which must be a child of the node, to a new parent.

It will be added to the end of the parent

>>> document.content_root.move_child_to_parent(document.content_root.find(type_ref='line'), document.content_root)
Parameters:
  • target_child – the child node that needs to be moved
  • target_parent – the parent to attach this node to
adopt_children(self, children, replace=False)[source]

This will take a list of content nodes and adopt them under this node, ensuring they are re-parented.

It will be added to the end of the parent

>>> document.content_root.adopt_children(document.content_root.find(type_ref='line'), replace=True)
Parameters:
  • children – a list of the children to adopt
  • replace – if True the node will remove all current children and replace them with the new list
remove_tag(self, tag_name)[source]

This will remove a tag from the given node

>>> document.content_node.remove_tag(tag_name='foo')
Parameters:tag_name – The tag to be applied to the nodes in range
collect_nodes_to(self, end_node)[source]

Return a list of the sibling nodes between the current node and the end node

>>> document.content_node.children[0].collect_nodes_to(end_node=document.content_node.children[5])
Parameters:end_node – The node to end at
tag_nodes_to(self, end_node, tag_name)[source]

Tag all the nodes from this node to the end node with the given tag name

>>> document.content_node.children[0].tag_nodes_to(document.content_node.children[5], tag_name='foo')
param end_node:The node to end with
param tag_name:The tag name
get_node_at_index(self, index)[source]

Returns the node at a specific index, if the index it outside the first (0), or last index it will return null, if not it will return the node at index, if there isn’t a node at that index it will return a ‘virtual’ node that will represent the node to the side of this node and have an index and no features or content

Parameters:index – The index (zero-based) for the child node
Returns:Node at index, or None is the index is outside the boundaries of child nodes
has_next_node(self, type_re='.*', skip_virtual=False)[source]

Returns True if the node has a next node

Parameters:
  • type_re – Type name (regular expression)
  • skip_virtual – True to skip any virtual nodes and only return the next real node (default False)
Returns:

True if there is a next sibling node

has_previous_node(self, type_re='.*', skip_virtual=False)[source]
Returns True if the node has a previous node
Parameters:
  • type_re – Type name (regular expression)
  • skip_virtual

    True to skip any virtual nodes and only return the next real node (default False)

    return:True if there is a previous sibling node
next_node(self, type_re='.*', skip_virtual=False, has_no_content=False, traverse=<Traverse.SIBLING: 1>)[source]

Returns the next sibling content node, note that this logic is based on the index, therefore the next node might actually be a virtual node that is created to fill a gap in the document, since the index allows for sparse documents

Parameters:
  • type_re – the regular expression for the type of node (default ‘.*’)
  • skip_virtual – True to skip any virtual nodes and only return the next real node (default False)
  • has_no_content – True if you only want to return a node that has no content
  • traverse – By default we traverse siblings, however you can include CHILDREN, PARENT or ALL
Returns:

the next node or None if no node exists

previous_node(self, type_re='.*', skip_virtual=False, has_no_content=False, traverse=<Traverse.SIBLING: 1>)[source]

Returns the previous sibling content node, note that this logic is based on the index, therefore the next node might actually be a virtual node that is created to fill a gap in the document, since the index allows for sparse documents

Parameters:
  • type_re – the regular expression for the type of node (default ‘.*’)
  • skip_virtual – True to skip any virtual nodes and only return the next real node (default False)
  • has_no_content – True if you only want to return a node that no content
  • traverse – By default we traverse siblings, however you can include CHILDREN, PARENT or ALL
Returns:

the previous node or None if no node exists

get_last_child_index(self)[source]

Returns the max index value for the children of this node, if the node has no children it returns None

Returns:Thhe max index of the children of this node, or None is no children
is_first_child(self)[source]

Returns True if this is the first child, also True if it has no parent

Returns:True if this is the first child
is_last_child(self)[source]

Returns True if this is the last child, also True if it has no parent

Returns:True if this is the first child

Spatial

One of the core mix-ins is Spatial. It is based on the concept of holding spatial information about the content nodes.

This spatial information can then be used by the mix-in’s methods to allow you to both pull spatial information, but also to query it.

set_statistics(self, statistics)[source]

Set the spatial statistics for this node

>>> document.content_node.find(type_re='page').set_statistics(NodeStatistics())
Parameters:statistics – the statistics object
get_statistics(self)[source]

Get the spatial statistics for this node

>>> document.content_node.find(type_re='page').get_statistics()
<kodexa.spatial.NodeStatistics object at 0x7f80605e53c8>
Returns:the statistics object (or None if not set)
set_bbox(self, bbox)[source]

Set the bounding box for the node, this is structured as:

[x1,y1,x2,y2]

>>> document.content_node.find(type_re='page').set_bbox([10,20,50,100])
Parameters:bbox – the bounding box array
get_bbox(self)[source]

Get the bounding box for the node, this is structured as:

[x1,y1,x2,y2]

>>> document.content_node.find(type_re='page').get_bbox()
[10,20,50,100]
Returns:the bounding box array
set_rotate(self, rotate)[source]

Set the rotate of the node

>>> document.content_node.find(type_re='page').set_rotate(90)

:param rotate the rotation of the node

get_rotate(self)[source]

Get the rotate of the node

>>> document.content_node.find(type_re='page').get_rotate()
90
Returns:the rotation of the node
get_x(self)[source]

Get the X position of the node

>>> document.content_node.find(type_re='page').get_x()
10
Returns:the X position of the node
get_y(self)[source]

Get the Y position of the node

>>> document.content_node.find(type_re='page').get_y()
90
Returns:the Y position of the node
get_width(self)[source]

Get the width of the node

>>> document.content_node.find(type_re='page').get_width()
70
Returns:the width of the node
get_height(self)[source]

Get the height of the node

>>> document.content_node.find(type_re='page').get_height()
40
Returns:the height of the node
set_bbox_from_children(self)[source]

Set the bounding box for this node based on its children

collapse(self, type_re)[source]

Will collapse the given type, this will remove this type from the hierarchy.

Parameters:type_re – the type that you will collapse
Returns: