API Documentation

Kodexa is a Python framework to enable flexible data engineering with semi-structured and unstructured documents and data.

Core Model

The core model provides the object structure for Documents, ContentNodes and Features which is used as the foundation for working with unstructured data in the framework.

Create a new instance of a Document, you will be required to provide a DocumentMetadata object

>>> document = Document(DocumentMetadata())
class Document(metadata=None, content_node: kodexa.model.model.ContentNode = None, source=<kodexa.model.model.SourceMetadata object>)[source]

A Document is a collection of metadata and a set of content nodes.

add_mixin(mixin)[source]

Add the given mixin to this document, this will apply the mixin to all the content nodes, and also register it with the document so that future invocations of create_node will ensure the node has the mixin appled.

>>> document.add_mixin('spatial')
create_node(node_type: str, content: str = None, virtual: bool = False, parent: kodexa.model.model.ContentNode = None, index: int = 0)[source]

Creates a new node for the document. The new node is not added to the document, but any mixins that have been applied to the document will also be available on the new node.

>>> document.create_node(node_type='page')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
Parameters:
  • node_type (str) – The type of node.
  • content (str) – The content for the node; defaults to None.
  • virtual (bool) – Indicates if this is a ‘real’ or ‘virtual’ node; default is False. ‘Real’ nodes contain document content.

‘Virtual’ nodes are synthesized as necessary to fill gaps in between non-consecutively indexed siblings. Such indexing arises when document content is sparse. :param ContentNode parent: The parent for this newly created node; default is None; :param int index: The index property to be set on this node; default is 0;

Returns:This newly created node.
Return type:ContentNode
static from_dict(doc_dict)[source]

Build a new Document from a dictionary.

>>> Document.from_dict(doc_dict)
Parameters:doc_dict (dict) – A dictionary representation of a Kodexa Document.
Returns:A complete Kodexa Document
Return type:Document
classmethod from_file(file)[source]

Creates a Document that has a ‘file-handle’ connector to the specified file.

Parameters:file (file) – The file to which the new Document is connected.
Returns:A Document connected to the specified file.
Return type:Document
static from_json(json_string)[source]

Create an instance of a Document from a JSON string.

>>> Document.from_json(json_string)
Parameters:json_string (str) – A JSON string representation of a Kodexa Document
Returns:A complete Kodexa Document
Return type:Document
static from_kdxa(file_path)[source]

Read an .kdxa file from the given file_path and

>>> document = Document.from_kdxa('my-document.kdxa')
Parameters:file_path – the path to the mdoc file
static from_msgpack(bytes)[source]

Create an instance of a Document from a message pack byte array.

>>> Document.from_msgpack(open(os.path.join('news-doc.kdxa'), 'rb').read())
Parameters:bytes (bytes) – A message pack byte array.
Returns:A complete Kodexa Document
Return type:Document
classmethod from_url(url, headers=None)[source]

Creates a Document that has a ‘url’ connector for the specified url.

Parameters:
  • url (str) – The URL to which the new Document is connected.
  • headers (dict) – Headers that should be used when reading from the URL
Returns:

A Document connected to the specified URL with the specified headers (if any).

Return type:

Document

get_mixins()[source]

Get the list of mixins that have been enabled on this document.

>>> document.get_mixins()
['spatial','finders']
get_root()[source]

Get the root content node for the document (same as content_node)

>>> node = document.get_node()
select(selector, variables={})[source]

Execute a selector on the root node and then return a list of the matching nodes.

>>> document.select('.')
   [ContentNode]
Parameters:
  • selector (str) – The selector (ie. //*)
  • variables (dict, optional) – A dictionary of variable name/value to use in substituion; defaults to an empty dictionary. Dictionary keys should match a variable specified in the selector.
Returns:

A list of the matching ContentNodes. If no matches found, list is empty.

Return type:

list[ContentNodes]

select_as_node(selector, variables={})[source]

Execute a selector on the root node and then return new ContentNode with the results set as its children.

>>> document.select('//line')
   ContentNode
Parameters:
  • selector – The selector (ie. //*)
  • variables (dict, optional) – A dictionary of variable name/value to use in substituion; defaults to an empty dictionary. Dictionary keys should match a variable specified in the selector.
Returns:

A new ContentNode. All ContentNodes on this Document that match the selector value are added as the children for the returned ContentNode.

Return type:

ContentNode

to_dict()[source]

Create a dictionary representing this Document’s structure and content.

>>> document.to_dict()
Returns:A dictionary representation of this Document.
Return type:dict
to_html()[source]

Generate HTML and javascript necessary for rendering this ContentNode.

Returns:HTML and javascript that will render this ContentNode
Return type:str
to_json()[source]

Create a JSON string representation of this Document.

>>> document.to_json()
Returns:The JSON formatted string representation of this Document.
Return type:str
to_kdxa(file_path)[source]

Write the document to the kdxa format (msgpack) which can be used with the Kodexa platform

>>> document.to_mdoc('my-document.kdxa')
Parameters:file_path – the path to the mdoc you wish to create
to_msgpack()[source]

Convert this document object structure into a message pack

>>> document.to_msgpack()
class DocumentMetadata(*args, **kwargs)[source]

A flexible dict based approach to capturing metadata for the document

class ContentNode(document, node_type: str, content='', content_parts=None)[source]

A Content Node identifies a section of the document containing logical grouping of information.

The node will have content and can include any number of features.

You should always create a node using the Document’s create_node method to ensure that the correct mixins are applied.

>>> new_page = document.create_node(node_type='page')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
>>> current_content_node.add_child(new_page)

or

>>> new_page = document.create_node(node_type='page', content='This is page 1')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
>>> current_content_node.add_child(new_page)
add_child(child, index=None)[source]

Add a ContentNode as a child of this ContentNode

>>> new_page = document.create_node(node_type='page')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
>>> current_content_node.add_child(new_page)
Parameters:
  • child (ContentNode) – The node that will be added as a child of this node
  • index (int, optional) – The index at which this child node should be added; defaults to None. If None, index is set as the count of child node elements.
add_child_content(node_type, content, index=None)[source]

Convenience method to allow you to quick add a child node with a type and content

Parameters:
  • node_type – the node type
  • content – the content
  • index – the index (optional)
Returns:

the new ContentNode

add_feature(feature_type, name, value, single=True, serialized=False)[source]

Add a new feature to this ContentNode.

Note: if a feature for this feature_type/name already exists, the new value will be added to the existing feature; therefore the feature value might become a list.

>>> new_page = document.create_node(node_type='page')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
>>> new_page.add_feature('pagination','pageNum',1)
Parameters:
  • feature_type (str) – The type of feature to be added to the node.
  • name (str) – The name of the feature.
  • value (Any) – The value of the feature.
  • single (bool, optional) – Indicates that the value is singular, rather than a collection (ex: str vs list); defaults to True.
  • serialized (bool, optional) – Indicates that the value is/is not already serialized; defaults to False.
Returns:

The feature that was added to this ContentNode.

Return type:

ContentFeature

adopt_children(children, replace=False)[source]

This will take a list of content nodes and adopt them under this node, ensuring they are re-parented.

>>> # select all nodes of type 'line', then the root node 'adopts' them
>>> # and replaces all it's existing children with these 'line' nodes.
>>> document.get_root().adopt_children(document.select('//line'), replace=True)
Parameters:
  • children (list[ContentNode]) – A list of ContentNodes that will be added to the end of this node’s children collection
  • replace (bool) – If True, will remove all current children and replace them with the new list; defaults to True
collect_nodes_to(end_node)[source]

Get the the sibling nodes between the current node and the end_node.

>>> document.content_node.children[0].collect_nodes_to(end_node=document.content_node.children[5])
Parameters:end_node (ContentNode) – The node to end at
Returns:A list of sibling nodes between this node and the end_node.
Return type:list[ContentNode]
find(content_re='.*', node_type_re='.*', direction=<FindDirection.CHILDREN: 1>, tag_name=None, instance=1, tag_name_re=None, use_all_content=False)[source]

Return a node related to this node (parent or child) that matches the content and/or node type specified by regular expressions.

>>> document.get_root().find(content_re='.*Cheese.*',instance=2)
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
Parameters:
  • content_re (str, optional) – The regular expression to match against the node’s content; default is ‘.*’.
  • node_type_re (str, optional) – The regular expression to match against the node’s type; default is ‘.*’.
  • direction (FindDirection(enum), optional) – The direction to search (CHILDREN or PARENT); default is FindDirection.CHILDREN.
  • tag_name (str, optional) – The tag name that must exist on the node; default is None.
  • instance (int, optional) – The instance of the matching node to return (may have multiple matches). Value must be greater than zero; default is 1.
  • tag_name_re (str, optional) – The regular expression that will match the tag_name that must exist on the node; default is None.
  • use_all_content (bool, optional) – Match content_re against the content of this node concatenated with the content of its child nodes; default is False.
Returns:

Matching node (if found), or None.

Return type:

ContentNode or None.

find_with_feature_value(feature_type, feature_name, value, direction=<FindDirection.CHILDREN: 1>, instance=1)[source]

Return a node related to this node (parent or child) that has a specific feature type, feature name, and feature value.

>>> document.content_node.find_with_feature_value(feature_type='tag',feature_name='is_cheese',value=[1,10,'The Cheese has moved'])
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
Parameters:
  • feature_type (str) – The feature type.
  • feature_name (str) – The feature name.
  • value (Any) – The feature value.
  • direction (FindDirection(enum), optional) – The direction to search (CHILDREN or PARENT); default is FindDirection.CHILDREN.
  • instance (int, optional) – The instance of the matching node to return (may have multiple matches). Value must be greater than zero; default is 1.
Returns:

Matching node (if found), or None.

Return type:

ContentNode or None

findall(content_re='.*', node_type_re='.*', direction=<FindDirection.CHILDREN: 1>, tag_name=None, tag_name_re=None, use_all_content=False)[source]

Search for related nodes (child or parent) that match the content and/or type specified by regular expressions.

>>> document.content_node.findall(content_re='.*Cheese.*')
[<kodexa.model.model.ContentNode object at 0x7f80605e53c8>,
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>]
Parameters:
  • content_re (str, optional) – The regular expression to match against the node’s content; default is ‘.*’.
  • node_type_re (str, optional) – The regular expression to match against the node’s type; default is ‘.*’.
  • direction (FindDirection(enum), optional) – The direction to search (CHILDREN or PARENT); default is FindDirection.CHILDREN.
  • tag_name (str, optional) – The tag name that must exist on the node; default is None.
  • tag_name_re (str, optional) – The regular expression that will match the tag_name that must exist on the node; default is None.
  • use_all_content (bool, optional) – Match content_re against the content of this node concatenated with the content of its child nodes; default is False.
Returns:

List of matching content nodes

Return type:

list[ContentNode]

findall_compiled(value_re_compiled, node_type_re_compiled, direction, tag_name, tag_name_compiled, use_all_content)[source]

Search for a node that matches on the value and or type using regular expressions using compiled expressions

findall_with_feature_value(feature_type, feature_name, value, direction=<FindDirection.CHILDREN: 1>)[source]

Get all nodes related to this node (parents or children) that have a specific feature type, feature name, and feature value.

>>> document.content_node.findall_with_feature_value(feature_type='tag',feature='is_cheese', value=[1,10,'The Cheese has moved'])
[<kodexa.model.model.ContentNode object at 0x7f80605e53c8>]
Parameters:
  • feature_type (str) – The feature type.
  • feature_name (str) – The feature name.
  • value (Any) – The feature value.
  • direction (FindDirection(enum), optional) – The direction to search (CHILDREN or PARENT); default is FindDirection.CHILDREN.
Returns:

list of the matching content nodes

Return type:

list[ContentNode]

static from_dict(document, content_node_dict: addict.addict.Dict)[source]

Build a new ContentNode from a dictionary representtion.

>>> ContentNode.from_dict(document, content_node_dict)
Parameters:
  • document (Document) – The Kodexa document from which the new ContentNode will be created (not added).
  • content_node_dict (dict) – The dictionary-structured representation of a ContentNode. This value will be unpacked into a ContentNode.
Returns:

A ContentNode containing the unpacked values from the content_node_dict parameter.

Return type:

ContentNode

get_all_content(separator=' ')[source]

Get this node’s content, concatenated with all of its children’s content.

>>> document.content_node.get_all_content()
"This string is made up of multiple nodes"
Parameters:separator (str, optional) – The separator to use in joining content together; defaults to ” “.
Returns:The complete content for this node concatenated with the content of all child nodes.
Return type:str
get_all_tags()[source]

Get the names of all tags that have been applied to this node or to its children.

>>> document.content_node.find(content_re='.*Cheese.*').get_all_tags()
['is_cheese']
Returns:A list of the tag names belonging to this node and/or its children.
Return type:list[str]
get_children()[source]

Returns a list of the children of this node.

>>> node.get_children()
Returns:The list of child nodes for this ContentNode.
Return type:list[ContentNode]
get_content()[source]

Get the content of this node.

>>> new_page.get_content()
"This is page one"
Returns:The content of this ContentNode.
Return type:str
get_feature(feature_type, name)[source]

Gets the value for the given feature.

>>> new_page.get_feature('pagination','pageNum')
1
Parameters:
  • feature_type (str) – The type of the feature.
  • name (str) – The name of the feature.
Returns:

The feature with the specified type & name. If no feature is found, None is returned.

Return type:

ContentFeature or None

get_feature_value(feature_type, name)[source]

Get the value for a feature with the given name and type on this ContentNode.

>>> new_page.get_feature_value('pagination','pageNum')
1
Parameters:
  • feature_type (str) – The type of the feature.
  • name (str) – The name of the feature.
Returns:

The value of the feature if it exists on this ContentNode otherwise, None.

Return type:

Any or None

get_features()[source]

Get all features on this ContentNode.

Returns:A list of the features on this ContentNode.
Return type:list[ContentFeature]
get_features_of_type(feature_type)[source]

Get all features of a specific type.

>>> new_page.get_features_of_type('my_type')
[]
Parameters:feature_type (str) – The type of the feature.
Returns:A list of feature with the specified type. If no features are found, an empty list is returned.
Return type:list[ContentFeature]
get_last_child_index()[source]

Returns the max index value for the children of this node. If the node has no children, returns None.

Returns:The max index of the children of this node, or None if there are no children.
Return type:int or None
get_node_at_index(index)[source]

Returns the child node at the specified index. If the specified index is outside the first (0), or last child’s index, None is returned.

Note: documents allow for sparse representation and child nodes may not have consecutive index numbers. If there isn’t a child node at the specfied index, a ‘virtual’ node will be returned. This ‘virtual’ node will have the node type of its nearest sibling and will have an index value, but will have no features or content.

Parameters:index (int) – The index (zero-based) for the child node.
Returns:Node at index, or None if the index is outside the boundaries of child nodes.
Return type:ContentNode or None
get_node_type()[source]

Get the type of this node.

>>> new_page.get_content()
"page"
Returns:The type of this ContentNode.
Return type:str
get_tag(tag_name)[source]

Returns the value of a tag, this can be either a single list [start,end,value] or if multiple parts of the content of this node match you can end up with a list of lists i.e. [[start1,end1,value1],[start2,end2,value2]]

>>> document.content_node.find(content_re='.*Cheese.*').get_tag('is_cheese')
[0,10,'The Cheese Moved']
Parameters:tag_name – The name of the tag
Returns:A list tagged location and values for this label in this node
get_tag_values(tag_name, include_children=False)[source]

Get the values for a specific tag name

Parameters:
  • tag_name – tag name
  • include_children – include the children of this node
Returns:

a list of the tag values

get_tags()[source]

Returns a list of the names of the tags on the given node

>>> document.content_node.select('*').get_tags()
['is_cheese']
Returns:A list of the tag name
has_feature(feature_type, name)[source]

Determines if a feature with the given feature and name exists on this content node.

>>> new_page.has_feature('pagination','pageNum')
True
Parameters:
  • feature_type (str) – The type of the feature.
  • name (str) – The name of the feature.
Returns:

True if the feature is present; else, False.

Return type:

bool

has_next_node(node_type_re='.*', skip_virtual=False)[source]

Determine if this node has a next sibling that matches the type specified by the node_type_re regex.

Parameters:
  • node_type_re (str, optional) – The regular expression to match against the next sibling node’s type; default is ‘.*’.
  • skip_virtual (bool, optional) – Skip virtual nodes and return the next real node; default is False.
Returns:

True if there is a next sibling node matching the specified type regex; else, False.

Return type:

bool

has_previous_node(node_type_re='.*', skip_virtual=False)[source]

Determine if this node has a previous sibling that matches the type specified by the node_type_re regex.

Parameters:
  • node_type_re (str, optional) – The regular expression to match against the previous sibling node’s type; default is ‘.*’.
  • skip_virtual (bool, optional) – Skip virtual nodes and return the next real node; default is False.
Returns:

True if there is a previous sibling node matching the specified type regex; else, False.

Return type:

bool

has_tag(tag)[source]

Determine if this node has a tag with the specified name.

>>> document.content_node.find(content_re='.*Cheese.*').has_tag('is_cheese')
True
>>> document.content_node.find(content_re='.*Cheese.*').has_tag('is_fish')
False
Parameters:tag (str) – The name of the tag.
Returns:True if node has a tag by the specified name; else, False;
Return type:bool
has_tags()[source]

Determines if this node has any tags at all.

>>> document.content_node.find(content_re='.*Cheese.*').has_tags()
True
Returns:True if node has any tags; else, False;
Return type:bool
is_first_child()[source]

Determines if this node is the first child of its parent or has no parent.

Returns:True if this node is the first child of its parent or if this node has no parent; else, False;
Return type:bool
is_last_child()[source]

Determines if this node is the last child of its parent or has no parent.

Returns:True if this node is the last child of its parent or if this node has no parent; else, False;
Return type:bool
move_child_to_parent(target_child, target_parent)[source]

This will move the target_child, which must be a child of the node, to a new parent.

It will be added to the end of the parent

>>> # Get first node of type 'line' from the first page
>>> target_child = document.get_root().select('//page')[0].select('//line')[0]
>>> # Get sixth node of type 'page'
>>> target_parent = document.get_root().select('//page')[5]
>>> # Move target_child (line) to the target_parent (sixth page)
>>> document.get_root().move_child_to_parent(target_child, target_parent)
Parameters:
  • target_child (ContentNode) – The child node that will be moved to a new parent node (target_parent).
  • target_parent (ContentNode) – The parent node that the target_child will be added to. The target_child will be added at the end of the children collection.
next_node(node_type_re='.*', skip_virtual=False, has_no_content=True)[source]

Returns the next sibling content node.

Note: This logic relies on node indexes. Documents allow for sparse representation and child nodes may not have consecutive index numbers. Therefore, the next node might actually be a virtual node that is created to fill a gap in the document. You can skip virtual nodes by setting the skip_virtual parameter to False.

Parameters:
  • node_type_re (str, optional) – The regular expression to match against the next sibling node’s type; default is ‘.*’.
  • skip_virtual (bool, optional) – Skip virtual nodes and return the next real node; default is False.
  • has_no_content (bool, optional) – Allow a node that has no content to be returned; default is True.
Returns:

The next node or None, if no node exists

Return type:

ContentNode or None

previous_node(node_type_re='.*', skip_virtual=False, has_no_content=False, traverse=<Traverse.SIBLING: 1>)[source]

Returns the previous sibling content node.

Note: This logic relies on node indexes. Documents allow for sparse representation and child nodes may not have consecutive index numbers. Therefore, the previous node might actually be a virtual node that is created to fill a gap in the document. You can skip virtual nodes by setting the skip_virtual parameter to False.

Parameters:
  • node_type_re (str, optional) – The regular expression to match against the previous node’s type; default is ‘.*’.
  • skip_virtual (bool, optional) – Skip virtual nodes and return the next real node; default is False.
  • has_no_content (bool, optional) – Allow a node that has no content to be returned; default is False.
  • traverse (Traverse(enum), optional) – The relationship you’d like to traverse (SIBLING, CHILDREN, PARENT, or ALL); default is Traverse.SIBLING.
Returns:

The previous node or None, if no node exists

Return type:

ContentNode or None

remove_feature(feature_type, name)[source]

Removes the feature with the given name and type from this node.

>>> new_page.remove_feature('pagination','pageNum')
Parameters:
  • feature_type (str) – The type of the feature.
  • name (str) – The name of the feature.
remove_tag(tag_name)[source]

Remove a tag from this content node.

>>> document.get_root().remove_tag('foo')
Parameters:tag_name (str) – The name of the tag that should be removed.
select(selector, variables=None)[source]

Select and return the child nodes of this node that match the selector value.

>>> document.get_root().select('.')
   [ContentNode]

or

>>> document.get_root().select('//*[hasTag($tagName)]', {"tagName": "div"})
   [ContentNode]
Parameters:
  • selector (str) – The selector (ie. //*)
  • variables (dict, optional) – A dictionary of variable name/value to use in substituion; defaults to None. Dictionary keys should match a variable specified in the selector.
Returns:

A list of the matching content nodes. If no matches are found, the list will be empty.

Return type:

list[ContentNode]

select_as_node(selector, variables=None)[source]

Select and return the child nodes of this content node that match the selector value. Matching nodes will be returned as the children of a new proxy content node.

Note this doesn’t impact this content node’s children. They are not adopted by the proxy node, therefore their parents remain intact.

>>> document.content_node.select_as_node('//line')
   ContentNode

or

>>> document.get_root().select_as_node('//*[hasTag($tagName)]', {"tagName": "div"})
   ContentNode
Parameters:
  • selector (str) – The selector (ie. //*)
  • variables (dict, optional) – A dictionary of variable name/value to use in substituion; defaults to None. Dictionary keys should match a variable specified in the selector.
Returns:

A new proxy ContentNode with the matching (selected) nodes as its children. If no matches are found, the list of children will be empty.

Return type:

ContentNode

set_feature(feature_type, name, value)[source]

Sets a feature for this ContentNode, replacing the value if a feature by this type and name already exists.

>>> new_page = document.create_node(node_type='page')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
>>> new_page.add_feature('pagination','pageNum',1)
Parameters:
  • feature_type (str) – The type of feature to be added to the node.
  • name (str) – The name of the feature.
  • value (Any) – The value of the feature.
Returns:

The feature that was added to this ContentNode

Return type:

ContentFeature

tag(tag_to_apply, selector='.', content_re=None, use_all_content=False, node_only=False, fixed_position=None, data=None, separator=' ')[source]

This will tag (see Feature Tagging) the expression groups identified by the regular expression.

>>> document.content_node.tag('is_cheese')
Parameters:
  • tag_to_apply – the name of tag that will be applied to the node
  • selector – The selector to identify the source nodes to work on (default . - the current node)
  • content_re – the regular expression that you wish to use to tag, note that we will create a tag for each matching group
  • use_all_content – apply the regular expression to the all_content (include content from child nodes)
  • separator – Separator to use for use_all_content
  • node_only – Ignore the matching groups and tag the whole node
  • fixed_position – use a fixed position, supplied as a tuple i.e. - (4,10) tag from position 4 to 10 (default None)
  • data – Attach the a dictionary of data for the given tag
tag_nodes_to(end_node, tag_to_apply)[source]

Tag all the nodes from this node to the end_node with the given tag name

>>> document.content_node.children[0].tag_nodes_to(document.content_node.children[5], tag_name='foo')
Parameters:
  • end_node (ContentNode) – The node to end with
  • tag_to_apply (str) – The tag name that will be applied to each node
tag_range(start_content_re, end_content_re, tag_to_apply, node_type_re='.*', use_all_content=False)[source]

This will tag all the child nodes between the start and end content regular expressions

>>> document.content_node.tag_range(start_content_re='.*Cheese.*', end_content_re='.*Fish.*', tag_to_apply='foo')
Parameters:
  • start_content_re – The regular expression to match the starting child
  • end_content_re – The regular expression to match the ending child
  • tag_to_apply – The tag name that will be applied to the nodes in range
  • node_type_re – The node type to match (default is all)
  • use_all_content – Use full content (including child nodes, default is False)
tag_text_tree()[source]

Return a text tree :return:

to_dict()[source]

Create a dictionary representing this ContentNode’s structure and content.

>>> node.to_dict()
Returns:The properties of this ContentNode and all of its children structured as a dictionary.
Return type:dict
to_html()[source]

Generate HTML and javascript necessary for rendering this ContentNode.

Returns:HTML and javascript that will render this ContentNode
Return type:str
to_json()[source]

Create a JSON string representation of this ContentNode.

>>> node.to_json()
Returns:The JSON formatted string representation of this ContentNode.
Return type:str
class ContentFeature(feature_type, name, value, description=None, single=True)[source]

A feature that has been added to a ContentNode

to_dict()[source]

Create a dictionary representing this ContentFeature’s structure and content.

>>> node.to_dict()
Returns:The properties of this ContentFeature structured as a dictionary.
Return type:dict

Kodexa Platform

Provides out of the box integration with the Kodexa paltform, enabling the universe of content services that are available

class KodexaPlatform[source]
class RemotePipeline(slug, connector, version=None, attach_source=True, parameters=None, auth=None)[source]

Allow you to interact with a pipeline that has been deployed to an instance of Kodexa Platform

static from_file(slug: str, file_path: str) → kodexa.cloud.kodexa.RemotePipeline[source]

Create a new pipeline using a file path as a source

Parameters:
  • slug – The slug for the remote pipeline
  • file_path – The path to the file
Returns:

A new pipeline

Return type:

Pipeline

static from_folder(slug: str, folder_path: str, filename_filter: str = '*', recursive: bool = False, relative: bool = False, caller_path: str = '/home/docs/checkouts/readthedocs.org/user_builds/kodexa-kodexa/checkouts/latest/docs') → kodexa.cloud.kodexa.RemotePipeline[source]

Create a pipeline that will run against a set of local files from a folder

Parameters:
  • slug – The slug for the remote pipeline
  • folder_path – The folder path
  • filename_filter – The filter for filename (i.e. *.pdf)
  • recursive – Should we look recursively in sub-directories (default False)
  • relative – Is the folder path relative to the caller (default False)
  • caller_path – The caller path (defaults to trying to work this out from the stack)
Returns:

A new pipeline

Return type:

RemotePipeline

static from_text(slug: str, text: str, *args, **kwargs) → kodexa.cloud.kodexa.RemotePipeline[source]

Build a new pipeline and provide text as the basic to create a document

Parameters:
  • slug – The slug for the remote pipeline
  • text – Text to use to create document
Returns:

A new pipeline

Return type:

RemotePipeline

static from_url(slug: str, url, headers=None) → kodexa.cloud.kodexa.RemotePipeline[source]

Build a new pipeline with the input being a document created from the given URL

Parameters:
  • slug – The slug for the remote pipeline
  • url – The URL ie. https://www.google.com
  • headers – A dictionary of headers
Returns:

A new instance of a remote pipeline

set_sink(sink)[source]

Set the sink you wish to use, note that it will replace any currently assigned sink

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
>>> pipeline.set_sink(ExampleSink())
Parameters:sink – the sink for the pipeline
class RemoteAction(slug, version=None, attach_source=False, options=None, auth=None)[source]

Allows you to interact with an action that has been deployed in the Kodexa platform

to_configuration()[source]

Returns a dictionary representing the configuration information for the step

Returns:dictionary representing the configuration of the step

Connectors

Connectors provide a way to access document (files or otherwise) from a source, and they form the starting point for Pipelines

class FileHandleConnector(file)[source]
class FolderConnector(path, file_filter='*', recursive=False, relative=False, caller_path='/home/docs/checkouts/readthedocs.org/user_builds/kodexa-kodexa/checkouts/latest/docs')[source]
class UrlConnector(url, headers=None)[source]

Pipeline

A Pipeline is a way to bring together a Connector, set of steps and then a sink to perform data cleansing, normalization, analysis and more.

class Pipeline(connector, name='Default', stop_on_exception=True, logging_level=20)[source]

A pipeline represents a way to bring together parts of the kodexa framework to solve a specific problem.

When you create a Pipeline you must provide the connector that will be used to source the documents.

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
Parameters:
  • connector – the connector that will be the starting point for the pipeline
  • name – the name of the pipeline (default ‘Default’)
  • stop_on_exception – Should the pipeline raise exceptions and stop (default True)
  • logging_level – The logging level of the pipeline (default INFO)
add_step(step, name=None, enabled=True, condition=None, options=None, attach_source=False, parameterized=False)[source]

Add the given step to the current pipeline

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
>>> pipeline.add_step(ExampleStep())

Note that it is also possible to add a function as a step, for example

>>> def my_function(doc):
>>>      doc.metadata.fishstick = 'foo'
>>>      return doc
>>> pipeline.add_step(my_function)

If you are using remote actions on a server, or for deployment to a remote pipeline you can also use a shorthand

>>> pipeline.add_step('kodexa/html-parser',options={'summarize':False})
Parameters:
  • step – the step to add
  • name – the name to use to describe the step (default None)
  • enabled – is the step enabled (default True)
  • condition – condition to evaluate before executing the step (default None)
  • options – options to be passed to the step if it is a simplified remote action
  • attach_source – if step is simplified remote action this determines if we need to add the source
  • parameterized – apply the pipeline’s parameters to the options
add_store(name, store)[source]

Add the store to the pipeline so that it is available to the pipeline

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
>>> pipeline.add_store("test-store", InMemoryObjectStore())
Parameters:
  • name – the name of the store (to refer to it)
  • store – the store that should be added
static from_file(file_path: str) → kodexa.pipeline.pipeline.Pipeline[source]

Create a new pipeline using a file path as a source :param file_path: The path to the file :return: A new pipeline :rtype: Pipeline

static from_folder(folder_path: str, filename_filter: str = '*', recursive: bool = False, relative: bool = False, caller_path: str = '/home/docs/checkouts/readthedocs.org/user_builds/kodexa-kodexa/checkouts/latest/docs', *args, **kwargs) → kodexa.pipeline.pipeline.Pipeline[source]

Create a pipeline that will run against a set of local files from a folder

Parameters:
  • folder_path – The folder path
  • filename_filter – The filter for filename (i.e. *.pdf)
  • recursive – Should we look recursively in sub-directories (default False)
  • relative – Is the folder path relative to the caller (default False)
  • caller_path – The caller path (defaults to trying to work this out from the stack)
Returns:

A new pipeline

Return type:

Pipeline

static from_store(org_slug: str, slug: str, query: str = '*') → kodexa.pipeline.pipeline.Pipeline[source]

Create a pipeline that will again the documents in a Kodexa platform store

Parameters:
  • org_slug – The organization’s slug
  • slug – The store slug
  • query – A query to be applied (defaults to *)
Returns:

A new pipeline

Return type:

Pipeline

static from_text(text: str) → kodexa.pipeline.pipeline.Pipeline[source]

Build a new pipeline and provide text as the basic to create a document

Parameters:text – Text to use to create document
Returns:A new pipeline
Return type:Pipeline
static from_url(url, headers=None)[source]

Build a new pipeline with the input being a document created from the given URL

Parameters:
Returns:

A new instance of a pipeline

run(parameters=None)[source]

Run the current pipeline, note that you must have a sink in place to allow the pipeline to run

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
>>> pipeline.set_sink(ExampleSink())
>>> pipeline.run()
Returns:The context from the run
set_sink(sink)[source]

Set the sink you wish to use, note that it will replace any currently assigned sink

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
>>> pipeline.set_sink(ExampleSink())
Parameters:sink – the sink for the pipeline
to_yaml()[source]

Will return the YAML representation of any actions that support conversion to YAML

The YAML representation for RemoteAction’s can be used for metadata only pipelines in the Kodexa Platform

Returns:YAML representation
class PipelineContext(content_provider=None, store_provider=None, existing_content_objects=None, context=None)[source]

Pipeline context is created when you create a pipeline and it provides a way to access information about the pipeline that is running. It can be made available to steps/functions so they can interact with it.

It also provides access to the ‘stores’ that have been added to the pipeline

add_store(name: str, store)[source]

Add a store with given name to the context

Parameters:
  • name – the name to refer to the store with
  • store – the instance of the store
get_current_document() → kodexa.model.model.Document[source]

Get the current document that is being processed in the pipeline

Returns:The current document, or None
get_store(name: str, default: Optional[kodexa.model.model.Store] = None) → kodexa.model.model.Store[source]

Get a store with given name from the context

Parameters:
  • name – the name to refer to the store with
  • default – optionally the default to create the store as if it isn’t there
Returns:

the store, or None is not available

get_store_names() → collections.abc.KeysView[source]

Return the list of store names in context

Returns:the list of store names
set_current_document(current_document: kodexa.model.model.Document)[source]

Set the Document that is currently being processed in the pipeline

Parameters:current_document – The current document
set_output_document(output_document: kodexa.model.model.Document)[source]

Set the output document from the pipeline

Parameters:output_document – the final output document from the pipeline
Returns:the final output document
class PipelineStatistics[source]

A set of statistics for the processed document

documents_processed document_exceptions

processed_document(document)[source]

Update statistics based on this document completing processing

Parameters:document – the document that has been processed

Sinks

Sinks are the end-point of a Pipeline and allow for the final output of the pipeline to be either stored or written out

class InMemoryDocumentSink[source]

An in-memory document sink can be used for testing where you want to capture a set of the documents as basic list in-memory and then access them

get_document(index)[source]

Get document at given index

Parameters:index – index to get the document at
sink(document)[source]

Adds the document to the sink

Parameters:document – document to add

Stores

Stores are persistence components for Documents. Typically, they can act as either a Connector or a Sink

class JsonDocumentStore(store_path: str, force_initialize: bool = False)[source]

An implementation of a document store that uses JSON files to store the documents and maintains an index.idx containing some basics of the documents

add(document: kodexa.model.model.Document)[source]

Add a new document and return the index position

Returns:The index of the document added
count()[source]

The number of documents in the store

Returns:The number of documents
delete(idx: int)[source]

Delete the document at the given index

Returns:The Document that was removed
get(idx: int)[source]

Load the document at the given index

Returns:Document at given index
get_document(index: int)[source]

Gets the document from the specific index

Parameters:index – index of document to get
Returns:the document
load(document_id: str)[source]

Loads the document with the given document ID

:return the document

read_index()[source]

Method to read the document index from the store path

reset_connector()[source]

Reset the index back to the beginning

save_index()[source]

Method to write the JSON store index back to store path

class TableDataStore(columns=None, rows=None, source_documents=None)[source]

Stores data as a list of lists that can represent a table.

This is a good store when you are capturing nested or tabular data.

Parameters:
  • columns – a list of the column names (default to dynamic)
  • rows – initial set of rows (default to empty)
  • source_documents – initial dictionary of document UUID to row links (default to empty)
add(row)[source]

Writes a row to the Data Store

Parameters:row – the row (as a list) to add
count()[source]

Returns the number of rows in the store

Returns:number of rows
merge(other_store)[source]

Merge another table store into this store

Parameters:other_store
Returns:the other store
to_dict()[source]

Create a dictionary representing this TableDataStore’s structure and content.

>>> table_data_store.to_dict()
Returns:The properties of this TableDataStore structured as a dictionary.
Return type:dict
class DictDataStore(dicts=None)[source]

Stores data as a list of dictionaries that can be any structure

This is a good store when you are capturing nested or semi-structured data

add(dict)[source]

Writes a dict to the Data Store

Parameters:dict – the dict to add to the store
count()[source]

Returns the number of dictionaries in the store

Returns:number of dictionaries
merge(other_store)[source]

Merge another table store into this store

Parameters:other_store
Returns:the other store
to_dict()[source]

Create a dictionary representing this DictDataStore’s structure and content.

>>> this_dictionary.to_dict()
Returns:The properties of this DictDataStore structured as a dictionary.
Return type:dict
class DataStoreHelper[source]

A small helper that can convert a dictionary back into a store type

static from_dict(dict)[source]

Build a new TableDataStore or DictDataStore from a dictionary.

>>> Document.from_dict(doc_dict)
Parameters:doc_dict (dict) – A dictionary representation of a Kodexa Document.
Returns:A TableDataStore or DictDataStore - driven from ‘type’ in doc_dict. If ‘type’ is not present or does not align with one of these two types, None is returend.
Return type:TableDataStore, DictDataStore, or None

Steps

Common and reusable steps

class NodeTagger(selector, tag_to_apply, content_re='.*', use_all_content=True, node_only=False)[source]

A node tagger allows you to provide a type and content regular expression and then tag content in all matching nodes.

It allows for multiple matching groups to be defined, also the ability to use all content and also just tag the node (ignoring the matching groups)

class TextParser(decode=False, encoding='utf-8')[source]

The text parser can load a source file as a text document and creates a single content node with the text

class RollupTransformer(collapse_type_res=None, reindex: bool = True, selector: str = '.', separator_character: str = None, get_all_content: bool = False)[source]

The rollup step allows you to decide how you want to collapse content in a document by removing nodes while maintaining content and features as needed

class TagsToKeyValuePairExtractor(store_name, include=[], exclude=[], include_node_content=True)[source]

Extract all the tags from a document into a key/value pair table store