Purpose
Generate a reference data source to be used by a normalization task. Analyze any attribute and generate grouping of similar values. The output is a reference Data Source with these attributes:
- Normalized value
- Alias values
Field Description
- Generate normalization reference data for this attribute – Select the attribute for which we need to generate reference data.
- Advanced configuration description: Identification of similar values is based on a combination of these algorithms:
- Degree of fuzziness – Fuzzy matching based on this parameter (0.1 to 1, where 0.1 is maximum fuzziness)
- Percentage of leading text that must match – percentage of similarity
- Ignore if characters less than – there is no similarity check if character length is less than specified in this field.
Tips
- The cleaner the source data, the better the result. When generating company name reference data, use the Company Name Clean Up task to pre-clean the data first.
- Identification of similar values is based on a combination of these algorithms:
- Fuzzy matching based on your parameters
- Values that begin with identical words
- Over 90% similarity
- The matching algorithm is not case sensitive.
- The matching algorithm ignores short words. The threshold is configurable with a default of 3.
Examples
- Generate reference for the purpose of normalizing company names.
- Primary value = Toyota
- Aliases = Toyota motor, Toyota motor sales, Toyota usa, Toyota financial services
Support Contacts
If you have any additional questions, please feel free to contact us at help@openprisetech.com.