## Data Profiler The Data Profiler workflow will be using the `orm-profiler` processor. After running a Metadata Ingestion workflow, we can run Data Profiler workflow. While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server. ### 1. Define the YAML Config This is a sample config for the profiler: {% codePreview %} {% codeInfoContainer %} {% codeInfo srNumber=13 %} #### Source Configuration - Source Config You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json). **generateSampleData**: Option to turn on/off generating sample data. {% /codeInfo %} {% codeInfo srNumber=14 %} **profileSample**: Percentage of data or no. of rows we want to execute the profiler and tests on. {% /codeInfo %} {% codeInfo srNumber=15 %} **threadCount**: Number of threads to use during metric computations. {% /codeInfo %} {% codeInfo srNumber=16 %} **processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information. {% /codeInfo %} {% codeInfo srNumber=17 %} **confidence**: Set the Confidence value for which you want the column to be marked {% /codeInfo %} {% codeInfo srNumber=18 %} **timeoutSeconds**: Profiler Timeout in Seconds {% /codeInfo %} {% codeInfo srNumber=19 %} **databaseFilterPattern**: Regex to only fetch databases that matches the pattern. {% /codeInfo %} {% codeInfo srNumber=20 %} **schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern. {% /codeInfo %} {% codeInfo srNumber=21 %} **tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern. {% /codeInfo %} {% codeInfo srNumber=22 %} #### Processor Configuration Choose the `orm-profiler`. Its config can also be updated to define tests from the YAML itself instead of the UI: **tableConfig**: `tableConfig` allows you to set up some configuration at the table level. {% /codeInfo %} {% codeInfo srNumber=23 %} #### Sink Configuration To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`. {% /codeInfo %} {% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%} {% /codeInfoContainer %} {% codeBlock fileName="filename.yaml" %} ```yaml {% isCodeBlock=true %} source: type: {% $connector %} serviceName: local_athena sourceConfig: config: type: Profiler ``` ```yaml {% srNumber=13 %} generateSampleData: true ``` ```yaml {% srNumber=14 %} # profileSample: 85 ``` ```yaml {% srNumber=15 %} # threadCount: 5 ``` ```yaml {% srNumber=16 %} processPiiSensitive: false ``` ```yaml {% srNumber=17 %} # confidence: 80 ``` ```yaml {% srNumber=18 %} # timeoutSeconds: 43200 ``` ```yaml {% srNumber=19 %} # databaseFilterPattern: # includes: # - database1 # - database2 # excludes: # - database3 # - database4 ``` ```yaml {% srNumber=20 %} # schemaFilterPattern: # includes: # - schema1 # - schema2 # excludes: # - schema3 # - schema4 ``` ```yaml {% srNumber=21 %} # tableFilterPattern: # includes: # - table1 # - table2 # excludes: # - table3 # - table4 ``` ```yaml {% srNumber=22 %} processor: type: orm-profiler config: {} # Remove braces if adding properties # tableConfig: # - fullyQualifiedName: # profileSample: # default # profileSample: # default will be 100 if omitted # profileQuery: # columnConfig: # excludeColumns: # - # includeColumns: # - columnName: # - metrics: # - MEAN # - MEDIAN # - ... # partitionConfig: # enablePartitioning: # partitionColumnName: # partitionIntervalType: # Pick one of the variation shown below # ----'TIME-UNIT' or 'INGESTION-TIME'------- # partitionInterval: # partitionIntervalUnit: # ------------'INTEGER-RANGE'--------------- # partitionIntegerRangeStart: # partitionIntegerRangeEnd: # -----------'COLUMN-VALUE'---------------- # partitionValues: # - # - ``` ```yaml {% srNumber=23 %} sink: type: metadata-rest config: {} ``` {% partial file="/v1.5/connectors/yaml/workflow-config.md" /%} {% /codeBlock %} {% /codePreview %} - You can learn more about how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality from [here](/how-to-guides/data-quality-observability/profiler/workflow) ### 2. Run with the CLI After saving the YAML config, we will run the command the same way we did for the metadata ingestion: ```bash metadata profile -c ``` Note now instead of running `ingest`, we are using the `profile` command to select the Profiler workflow. {% tilesContainer %} {% tile title="Data Profiler" description="Find more information about the Data Profiler here" link="/how-to-guides/data-quality-observability/profiler/workflow" / %} {% /tilesContainer %}