CSV

Module `csv-enricher`

This plugin is used to apply glossary terms, tags, owners and domain at the entity level. It can also be used to apply tags and glossary terms at the column level. These values are read from a CSV file and can be used to either overwrite or append the above aspects to entities.

The format of the CSV must be like so, with a few example rows.

resource	subresource	glossary_terms	tags	owners	ownership_type	description	domain
urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)		[urn:li:glossaryTerm:AccountBalance]	[urn:li:tag:Legacy]	[urn:li:corpuser:datahub\|urn:li:corpuser:jdoe]	TECHNICAL_OWNER	new description	urn:li:domain:Engineering
urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)	field_foo	[urn:li:glossaryTerm:AccountBalance]				field_foo!
urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)	field_bar		[urn:li:tag:Legacy]			field_bar?

Note that the first row does not have a subresource populated. That means any glossary terms, tags, and owners will be applied at the entity field. If a subresource IS populated (as it is for the second and third rows), glossary terms and tags will be applied on the subresource. Every row MUST have a resource. Also note that owners can only be applied at the resource level and will be ignored if populated for a row with a subresource.

CLI based Ingestion

Install the Plugin

The csv-enricher source works out of the box with acryl-datahub.

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

View All Configuration Options

Field [Required]	Type	Description	Default
array_delimiter [✅]	string	Delimiter to use when parsing array fields (tags, terms and owners)
delimiter [✅]	string	Delimiter to use when parsing CSV	,
filename [✅]	string	Path to CSV file to ingest. It can also be in the form of a URL.	None
write_semantics [✅]	string	Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be "PATCH" or "OVERRIDE"	PATCH

The JSONSchema for this configuration is inlined below.

{
  "title": "CSVEnricherConfig",
  "type": "object",
  "properties": {
    "filename": {
      "title": "Filename",
      "description": "Path to CSV file to ingest. It can also be in the form of a URL.",
      "type": "string"
    },
    "write_semantics": {
      "title": "Write Semantics",
      "description": "Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be \"PATCH\" or \"OVERRIDE\"",
      "default": "PATCH",
      "type": "string"
    },
    "delimiter": {
      "title": "Delimiter",
      "description": "Delimiter to use when parsing CSV",
      "default": ",",
      "type": "string"
    },
    "array_delimiter": {
      "title": "Array Delimiter",
      "description": "Delimiter to use when parsing array fields (tags, terms and owners)",
      "default": "|",
      "type": "string"
    }
  },
  "required": [
    "filename"
  ],
  "additionalProperties": false
}

Code Coordinates

Class Name: datahub.ingestion.source.csv_enricher.CSVEnricherSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for CSV, feel free to ping us on our Slack

CSV

Module csv-enricher​

CLI based Ingestion​

Install the Plugin​

Config Details​

Code Coordinates​

Questions​