1C:Enterprise 8.3. Developer Guide. Contents
DATA ANALYSIS AND FORECASTING
Data analysis and forecast is intended for implementing tools for discovering dependencies that are usually hidden behind large amounts of data.For instance, one can analyze sales data and identify groups of goods that are usually purchased together. In the future (one of multiple variants) this information can be used to merchandize goods at a retail store. Goods may be grouped together (when a buyer comes to a store, she or he sees a barbecue and firelighters liquid, charcoal, meat, fishing tackle and a rubber boat nearby and purchases all of it) or in different areas of the store (a buyer comes to buy milk, and by the time he or she gets to the bread section they will have traversed half of the store).
Another use of the data analysis mechanism is forecasting a contractor’s behavior based on the existing data. By completing such an analysis one can find out to what degree their purchasing volumes depend on territorial distribution, the size of the company, period of cooperation, and other factors. Based on these dependencies, a new contractor’s behavior may be forecast, and a strategy of cooperation developed.
Use the forecasting functionality to plan your purchasing campaign. For instance: last month, a pet store sold 100 guinea pigs. The store must plan purchasing volumes for the next month. One of the most widely used ways to do this is to apply an adjustment factor to past sales periods. For instance, the adjustment factor (demand increase factor) is 1.5. Therefore, it is reasonable to plan to purchase 150 guinea pigs for the next month. However, if we analyze what customers buy after they buy such a pet, other conclusions can be drawn. By using data analysis and forecasting features one can see that pet food, litter, hay, and other "accessories" should be purchased.
It should be noted that this chapter mostly reviews 1C:Enterprise mechanisms and only briefly mentions ways to use the information obtained through simple examples.
14.1. OVERVIEW
A general overview of the data analysis and forecasting mechanism can be presented as follows:
Fig. 279. Interaction of the data analysis mechanism elements
This is the mechanism to be used to work with infobase data and data from other sources, preloaded into a value table or a spreadsheet document.
Apply one of the types of analysis to the source data to obtain the analysis results. The analysis results are a given data behavior model; it can be displayed in the final document or stored for further use.
The analysis results can be used to create a forecast model to forecast new data behavior, in line with the existing model.
For instance, one can analyze which goods are purchased together (on a single invoice) and store the analysis-based forecasting model in the database. In the future, when an invoice is created, a previously stored forecasting model can be retrieved from the database. One can input new data from an invoice and receive a forecast as an output, i.e. a list of goods that the next customer will probably buy (with a certain degree of likelihood) if these goods are offered at a store.
14.1.1. Main Mechanism Objects
Interaction between the main objects of the data analysis and forecasting mechanism can be shown as follows:
Fig. 280. Interconnection of main objects
Data analysis is an object responsible for data analysis. A data source is set for this object, different parameters and source data are specified. This object results in a data analysis result, while each type of analysis has its own object for working with the analysis results.
DataAnalysisSummaryStatisticsResult
DataAnalysisAssociationRulesResult
DataAnalysisSequentialPatternsResult
DataAnalysisDecisionTreeResult
DataAnalysisClusterizationResult
Setting data analysis columns – a collection of data analysis input columns. A data type, column role, and additional settings depending on the type of analysis performed are specified for each column. Data analysis parameters – a set of parameters to be used in data analysis. The range of parameters depends on the selected type of analysis. For instance, the number of clusters to divide source objects into, a measurement of distance between objects, etc. should be specified for cluster analysis.
Data source – source data for the analysis. A query result, spreadsheet document cell area, or a value table may function as a data source.
Data analysis result – a special object that contains information on the result of analysis. Each type of analysis provides its own result. For instance, Decision tree data analysis result will be an object of type DataAnalysisDecisionTreeResult. In the future, the result may be output to a spreadsheet document with the help of the data analysis result builder. It can be output through a programmatic access to its content, and can also be used to create a forecast model. Any data analysis result may be stored for further use.
Forecast model – is a special object to execute forecasting on the basis of input data (a forecast selection, selection columns settings, result settings, and analysis result). The type of the forecast model depends on the type of data analysis result. For instance, a model created for Association rules, will have type Predic- tionModelAssociationRules. This model will output forecasts of the following type: since this customer has purchased a specified set of goods, he or she will buy another set of goods with a certain probability. The forecasting data source is an input to the forecasting model. The result is a value table that contains forecast values.
Input column setup – is a set of special objects that show correspondence between forecasting model columns and forecast selection columns. For instance, a forecast model column named Goods may match a Nomenclature selection column.
Result columns setup – controls which columns will be included in the resulting forecast model table. For instance, a nomenclature item that the customer will probably buy and the probability of purchase can be output as an association rules result.
Result columns – is a table of results that includes columns specified in output column settings and contains forecast data. The specific content depends on the type of analysis.
14.1.2. Types of Data Analysis
The data analysis and forecasting mechanism implements a number of data analysis types:
Summary statistics
Association rules
Sequential patterns
Cluster analysis
Decision tree
14.1.2.1. Summary statistics
The Summary statistics analysis type is a mechanism for gathering general information on the data from the obtained data source. This type of analysis is used to pre-analyze the information.
This analysis shows a number of categorical and contiguous fields characteristics. When a report is output to a spreadsheet document, pie charts to display the contents of the fields are compiled.
14.1.2.2. Association rules
This type of analysis searches for the groups of objects or characteristic values that usually go together, and searches for association rules. Association rules can be used to determine what goods and services are usually purchased together.
This type of analysis can be used with hierarchical data to find association rules for specific goods and groups of goods. An important feature of this type of analysis is its capability to work with both object based data sources (where each column contains a certain characteristic of an object) and with event based sources where object characteristics are placed together in a single column.
14.1.2.3. Sequential patterns
The Sequential patterns analysis type helps reveal event sequences in a data source. For instance, this may be a chain of goods or services that are usually bought in sequence.
This type of analysis is useful for searching through a hierarchy and tracking sequential specific and sequential parent groups.
14.1.2.4. Cluster analysis
A cluster analysis is a way to divide a source set of objects being analyzed into groups of objects, so that each object more closely resembles objects in the same group rather than the objects of other groups. In the future, when the received groups (called clusters) are analyzed, one can define what characterizes a specific group, and make decisions on different methods used for working with objects from different groups. For instance, you can use cluster analysis to divide customers into groups and use different customer relationship strategies.
Use cluster analysis parameters to set up an algorithm in order to split and dynamically change sets of characteristics that are taken into account in analyses, and to specify weighting factors for them.
Clusterization results may be output as a dendrogram, a special type of diagram for graphically representing cluster analysis results.
14.1.2.5. Decision tree
The Decision tree analysis type is a way to create a hierarchical structure of classifying rules presented as a tree.
To build a decision tree, select a target attribute to base the classifier on, as well as a number of output attributes to be used to create the rules. A target attribute may contain information on whether a customer started using another service provider, whether the transaction has been successful and the work has been completed successfully, etc. Possible input attributes are as follows: age of an employee, his or her period of employment, the material wealth of the customer, the number of employees in the company, etc.
Analysis results are presented as a tree, each node of which contains a condition. In order to decide which class a certain new object should be referred to, it is necessary to answer questions in the nodes and complete a chain of steps from the root to a leaf of the tree. A positive answer takes the user to the next sub-node, while a negative answer takes the user to a neighboring node.
A set of analysis parameters can be used to control the accuracy of the resulting tree.
14.1.3. Forecasting Models
Forecasting models created by the mechanism are special objects generated from the data analysis results. In future they can be used to automatically forecast new data.
For instance, an association rules forecasting model created when the customer purchases were analyzed can be used to work with a customer at a store, so as to offer the goods that he or she will probably buy in conjunction with the main purchase.
14.2. THE SUMMARY STATISTICS ANALYSIS TYPE
The Summary statistics analysis type can be used for preliminary data analysis (before any other type of analysis is completed), etc.
A query result, a value table, or a cell area of a spreadsheet document can be used as a source of data for analysis.
Source data (from the point of view of the analysis taken) may be contiguous or discrete. Contiguous types may be Number and Date. Other types are discrete.
Different information can be obtained for the columns of different types.
Discrete data:
Number of values – is the number of values in a data source column (NULL is not considered a value).
Number of unique values (except for repeated values).
Mode – is the most frequent value in the data source. If the data contains several values that occur with the same frequency, the mode is the first such value found.
Frequency is the number of occurrences of a value in the data selection.
Relative frequency is defined as a relation between the number of occurrences of the value to the total number of values.
Accumulated frequency is the total of the value frequency and the total of frequencies of previous data selection values.
Accumulated relative frequency is the total of the accumulated value frequency and the total of relative frequencies of previous values.
Contiguous data:
Number of values.
Minimum value.
Maximum value.
Average.
Range is the difference between the maximum and the minimum values.
Standard deviation (root-mean-square deviation). Median is a value in the middle of the selection.
It should be noted that if several fields of different types are analyzed simultaneously, they are analyzed independently (without any mutual correlation).
Let us review the characteristics mentioned in the example.
Data selection (source of analysis) looks as follows:
Nomenclature item |
Quantity |
Nomenclature item |
Quantity |
Folding dining table |
1 |
Folding dinner table |
1 |
Round stool |
2 |
Square stool |
3 |
"Coziness" sofa |
1 |
"COMFORT" armchair |
2 |
"Jeans" sofa |
1 |
"COMFORT" armchair |
2 |
"Jeans" armchair |
2 |
"Wardrobe" closet |
1 |
Kitchen table 0.9x1.7 |
1 |
Folding dinner table |
1 |
"COMFORT" sofa |
1 |
Square stool |
2 |
Kitchen table 0.9x1.7 |
1 |
Dining table |
1 |
"Summer" chair |
4 |
"Summer" chair |
2 |
"COMFORT" sofa |
1 |
Round stool |
2 |
The following characteristics will be calculated, based on data analysis for the Count field (Contiguous analysis data type):
Characteristic |
Value |
Values |
20 |
Minimum |
1 |
Maximum |
4 |
Average |
1.6 |
Range |
3 |
Standard deviation |
0.8208 |
Median |
1 |
The following characteristics will be obtained for the Nomenclature field:
Characteristic |
Value |
Number of values |
20 |
Number of unique values |
12 |
Mode |
Folding dining table |
A frequency table for the nomenclature values will look as follows:
Fig. 281. Frequency table
The relative frequency is shown in the diagram below.
Fig. 282. Frequency diagram
To perform this analysis, use a code fragment similar to the one below:
&OnClient Procedure Summary Statistics(Command) Result = AnalysisSummaryStatistics(); EndProcedure &OnServerWithoutContext Function AnalysisSummaryStatistics(); Analysis = New DataAnalysis; Analysis.AnalysisType = Type("DataAnalysisSummaryStatistics"); Query = New Query; Query.Text = " |SELECT |Sales.Nomenclature, |Sales.Count |FROM |AccumulationRegister.Sales AS Sales"; Analysis.DataSource = Query.Execute(); AnalysisResult = Analysis.Execute(); Builder = New DataAnalysisReportBuilder(); Builder.Template = Undefined; Builder.AnalysisType = Type("DataAnalysisSummaryStatistics"); Spreadsheet = New SpreadsheetDocument; Builder.Output(AnalysisResult, Spreadsheet); Return Spreadsheet; EndFunction
Data analysis operations are performed in a server’s out-of-context function that returns a spreadsheet document containing the analysis results to the client. First of all, the DataAnalysis object is created. Next, the type of analysis to be completed is selected.
A query is then defined based on the text. The query result is set as the source of the analysis data. The analysis itself is completed when the Execute() method of the DataAnalysis object is executed. The analysis itself has no tools for visualizing analysis results. The DataAnalysisReportBuilder object is used for this purpose. When this object is created, the type of analysis to be conducted is re-specified. Then, the result of the received analysis is transferred as the first parameter of the Put() method, and the SpreadsheetDocument object created earlier is transferred as the second parameter.
At the conclusion of the algorithm, the spreadsheet document containing the analysis result is returned to the client’s Result data processor attribute with type
SpreadsheetDocument.
As a result, data similar to that analyzed above will be obtained.
14.3. THE ASSOCIATION RULES ANALYSIS TYPE
As has already been mentioned, this analysis type searches for combinations of objects or characteristic values that frequently go together. This is a way for determining groups for goods that are usually purchased together, so as to identify the most attractive information sources (optimizing costs in respect of these sources), etc.
A schematic view of the Association rules analysis type is as follows:
Fig. 283. The association rules analysis type: execution diagram
A query result, spreadsheet document cell area, or a value table may function as a data source. From the point of view of this type of analysis, source columns may be divided as follows:
NotUsed – ignored by the analysis.
Object – data from this column is used as objects (or events) of the executed analysis. Based on the values of this column, values of another column (Item) refer to one associated group.
Item – data from this column is used to obtain stable groups of values and create association rules.
The following analysis parameters impact the analysis result, together with column types settings:
MinSupport – determines the minimum percentage of cases when a certain combination of elements should occur. The groups where this value is less than the specified value are not included into the analysis result.
MinConfidence – shows the minimum percentage of cases when the rule is followed.
MinImportance – groups with a value less than the specified value are not included into the analysis result.
PruneRulesType – one of the variants of the AssociationRulesPruneType system enumeration:
○ Redundant – redundant rules are pruned.
○ Covered – the rules covered by other rules are pruned.
The result of analysis is as follows:
Information on the data (number of objects, number of items, average number of items in an object, number of groups found, number of association rules found).
Groups of items found – the contents of the group, the number of cases, and the percentage of cases where this group occurs is specified.
Association rules detected – a source set of elements, consequent (structure of elements), percentage of cases, confidence, and importance of the rule are specified.
Let us review the peculiarities of this type of analysis with the following data selection (we will try to determine a standard set of goods usually purchased together):
Recorder |
Nomenclature |
Sales invoice No. 000000001 |
Folding dining table |
Round stool |
|
Sales invoice No. 000000002 |
COMFORT sofa |
Sales invoice No. 000000003 |
Jeans sofa |
Jeans armchair |
|
Sales invoice No. 000000005 |
Kitchen table 0.9x1.7 |
COMFORT sofa |
|
Sales invoice No. 000000004 |
Kitchen table 0.9x1.7 |
"Summer" chair |
|
COMFORT sofa |
|
Sales invoice No. 000000006 |
Folding dinner table |
Square stool |
|
Sales invoice No. 000000007 |
COMFORT armchair |
Sales invoice No. 000000008 |
COMFORT armchair |
Sales invoice No. 000000009 |
Wardrobe closet |
Sales invoice No. 000000010 |
Folding dining table |
Square stool |
|
Dinner table |
|
Sales invoice No. 000000011 |
"Summer" chair |
Round stool |
An attribute through which data is related to one group is called the recorder value (a nomenclature specified in one document is considered to be simultaneously purchased). That means that the Recorder will be an object of analysis, and the Nomenclature will be an item of analysis.
The following code fragment will be used for analysis:
&OnClient Procedure AssociationRules(Command) Result = AnalysisAssociationRules(); EndProcedure &OnServerWithoutContext Function AnalysisAssociationRules(); Analysis = New DataAnalysis; Analysis.AnalysisType = Type("DataAnalysisAssociationRules"); Query = New Query; Query.Text = " |SELECT |Sales.Recorder, |Sales.Nomenclature |FROM |AccumulationRegister.Sales AS Sales"; Analysis.DataSource = Query.Execute(); // String used as an example, // a default column type value. Analysis.ColumnSetting.Nomenclature.ColumnType = DataAnalysisColumnTypeAssociationRules.Item; // String used as an example, // a default prune type value. Analysis.Parameters.RulesPruneType.Value = AssociationRulesPruneType.Redundant; AnalysisResult = Analysis.Execute(); Builder = New DataAnalysisReportBuilder(); Builder.Template = Undefined; Builder.AnalysisType = Type("DataAnalysisAssociationRules"); Spreadsheet = New SpreadsheetDocument; Builder.Output(AnalysisResult, Spreadsheet); Return Spreadsheet; EndFunction
The analysis results will look as follows:
Fig. 284. Association rules analysis result
The selection uses data from 11 documents (a reference in the Recorder field), the number of different nomenclature items is twelve:
Nomenclature
Folding dining table
Round stool
"Coziness" sofa
Jeans sofa
Jeans armchair
Kitchen table 0.9x1.7
"COMFORT" sofa
"Summer" chair
Square stool
"COMFORT" armchair
Wardrobe closet Dining table
The following group of goods has been found:
Fig. 285. Group of goods found
The whole group occurs in the document in only two cases out of eleven (which is shown in columns Number of cases and Percentage of cases).
The following association rules have been received:
Fig. 286. Association rules
Let us review the second one. Position Square stool occurred together with position Folding dining table in two cases out of eleven in this document. Based on this, the support has been calculated: (2/11*100 = 18.18%).
Confidence has been calculated as follows: both nomenclature items have been purchased in two cases, while position Folding dining table occurred 3 times. Based on this, confidence is equal to 2/3*100 = 66.67%.
Importance is evaluated as a ratio between the rule’s confidence and the support of Square stool position in the goods purchased. This position occurs in two documents out of eleven (18.18%). Importance is equal to 66.67%/18.18% = 3.67.
14.3.1. Rules Prune Types
Let us review an important parameter of this analysis type, i.e. PruneRulesType. System enumeration AssociationRulesPruneType contains the following prune values:
Covered
Redundant
Before proceeding to review prune variants, we will examine a number of general principles applied to association rules.
Any rule contains an antecedent and a consequent. For example:
Antecedent: If Product 1 has been purchased.
Consequent: Then Product 2 will also be purchased.
Please bear in mind that the consequent has a certain degree of confidence. In prune rules, probability characteristics may be taken into account or may be ignored (the content of the rule is the only thing that really matters).
14.3.1.1. Covered rules pruning
Let us review the Covered pruning option.
A rule may be covered by an antecedent or by a consequent. For example:
Rule 1. If products 1 and 3 have been purchased, then product 2 will also be purchased.
Rule 2. If product 1 has been purchased, then product 2 will also be purchased.
In this case rule 1 is considered covered, as the antecedent of the first rule is redundant in respect of the antecedent of the second rule.
An example of coverage by a consequent:
Rule 1. If product 1 has been purchased, then products 2 and 3 will also be purchased.
Rule 2. If product 1 has been purchased, then product 3 will also be purchased.
Rule 2 is covered by a consequent as the consequent of rule 1 is broader.
14.3.1.2. Pruning redundant rules
Coverage does not take into account the probability characteristics of the rules. They are only considered if the Redundant pruning type is used.
A rule is considered redundant by antecedent if it is covered by an antecedent and its confidence is equal to the confidence of the covering rule. For example:
Rule 1. If products 1 and 3 have been purchased, then product 2 will be purchased with 75% confidence.
Rule 2. If product 1 has been purchased, then product 2 will be purchased with 75% confidence.
Rule 1 is redundant towards rule 2 (it contains an additional condition that does not "disturb" the confidence characteristics of the rule).
Rule 1 is considered redundant by consequent if the number of cases of this rule is equal to the number of cases of the covering rule.
Rule 1. If product 1 has been purchased, then products 2 and 3 will be purchased in three cases.
Rule 2. If product 1 has been purchased, then product 3 will be purchased in three cases.
Rule 2 is redundant towards rule 1, as it contains a simpler consequent with the same confidence characteristics.
14.4. THE SEQUENTIAL PATTERNS ANALYSIS TYPE
This type of analysis reveals sequences of events (sequence templates). It can be used when a consequence of events over a period is one of the important indicators being analyzed. For instance, a pattern of goods purchased in a sequence within a certain period of time etc. may be identified.
A schema of the Sequential patterns analysis procedure is shown on fig. 287.
A query result, spreadsheet document cell area, or a value table may function as a data source. From the point of view of this type of analysis, source columns may be divided into the following: NotUsed – ignored in the analysis.
Sequence – data from this column is used in the analysis as an object of a sequence of events. The analysis uses a value in this column to associate data with a certain sequence of events.
Item – data from this column is used as sequence elements.
Time – this column is used to determine the time of event. This is a mandatory column for this type of analysis.
Fig. 287. A sequential patterns analysis execution diagram
The following analysis parameters impact the result of the analysis together with column type settings:
MinSupport – the minimum percentage of sequences where the sequence template is observed.
MinInterval – an attribute used to set the minimum sequence interval (an interval measurement unit, i.e. repetition, shall be defined).
MaxInterval – an attribute used to set the maximum sequence interval (an interval measurement unit, i.e. repetition, shall be defined).
TimeSliceWindow – an attribute setting the time slice window (a time slice window measurement unit, i.e. its repetition, shall be defined).
MinLength – the minimum length of sequences searched.
FindInHierarchy – a flag of hierarchy search (covers columns of the Item type).
A number of properties use DataAnalysisTimeIntervalUnitType. This system enumeration contains the following values:
Second |
|
Minute |
CurrentMinute |
Hour |
CurrentHour |
Day |
CurrentDay |
Week |
CurrentWeek |
TenDays |
CurrentTenDays |
Month |
CurrentMonth |
Quarter |
CurrentQuarter |
HalfYear |
CurrentHalfYear |
Year |
CurrentYear |
The sequence templates found are the main result of the analysis. These templates contain the following information:
contents of a sequence template
number of cases when this sequence has been observed
maximum intervals between events (if there are only 2 events, there is one interval)
minimum intervals between events (if there are only 2 events, there is one interval)
percentage of cases when the sequence has been executed
average intervals between events (if there are only 2 events, there is one interval)
Let us review how this type of analysis is executed using the following data selection:
Contractor |
First purchase |
Second purchase |
Third purchase |
Interval |
V.I. Bondarev |
Folding dining table |
COMFORT sofa |
COMFORT armchair |
25 days, 31 days |
Round stool |
||||
I.P. Ivanov |
Jeans sofa |
|
|
|
Jeans armchair |
||||
B.S. Petrov |
Kitchen table 0.9x1.7 |
COMFORT armchair |
|
43 days |
"Summer" chair |
||||
COMFORT sofa |
||||
G.O. Sidorov |
Folding dining table |
|
|
|
Square stool |
||||
V.K. Stepanov |
Folding dining table |
|
|
|
Square stool |
||||
Dining table |
||||
D.E. Fedorov |
Kitchen table 0.9x1.7 |
Wardrobe closet |
"Summer" chair |
58 days, 29 days |
Comfort sofa |
Round stool |
Data from the Contractor column will define affiliation to a certain chain of events, i.e. they define the sequence of analysis. The Nomenclature is an element of the sequence received.
To perform this analysis, use a code fragment similar to the one below:
&OnClient Procedure SequentialPatterns(Command) Result = AnalysisSequentialPatterns(); EndProcedure &OnServerWithoutContext Function AnalysisSequentialPatterns(); Analysis = New DataAnalysis; Analysis.AnalysisType = Type("DataAnalysisSequentialPatterns"); Query = New Query; Query.Text = " |SELECT |Sales.Contractor, |Sales.Nomenclature, |Sales.Period |FROM |AccumulationRegister.Sales AS Sales"; Analysis.DataSource = Query.Execute(); Analysis.ColumnsSetting.Period.ColumnType = DataAnalysisColumnTypeSequentialPatterns.Time; AnalysisResult = Analysis.Execute(); Builder = New DataAnalysisReportBuilder(); Builder.Template = Undefined; Builder.AnalysisType = Type("DataAnalysisSequentialPatterns"); Spreadsheet = New SpreadsheetDocument; Builder.Output(AnalysisResult, Spreadsheet); Return Spreadsheet; EndFunction
The Period field is defined as Time directly from the code (it is not analytically).
Analysis parameters set by default:
Fig. 288. Analysis parameters
The following data has been obtained in the analysis:
Fig. 289. General analysis data
The number of elements is 12. This is also the number of nomenclature positions that occur in the data selection.
Two sequences have been found:
Fig. 290. Sequences found
The first sequence occurs in two cases out of five. Therefore, support is 40%. Since the sequence depth is 2, each of these intervals contains one value.
14.5. THE DECISION TREE ANALYSIS TYPE
This type of analysis can be used to obtain a cause-and-effect hierarchy of conditions that facilitates making decisions. For instance, you can obtain a condition tree that will (within a certain degree of probability) help understand the reasons behind the termination of agreements with customers and define the conditions determining the agreement to be signed. Company managers may be profileoriented so as to serve different groups of customers, etc.
A schema of the Decision tree analysis procedure is shown on fig. 291.
From the point of view of this type of analysis, source columns may be divided into the following:
NotUsed
Input
Predictable
Analysis parameters used:
MinCaseCount – minimum number of items in a node
MaxDepth – maximum depth of the tree
SimplificationType – simplification type of the decision tree The result of the analysis is as follows:
decision tree
classification errors
Fig. 291. Decision tree analysis execution diagram
Let us review how this type of analysis is executed with the following data selection:
Contractor |
Number of retail spots |
Number of cars |
Company age |
Agreement date |
Type of agreement |
Status of relations |
ZAO Igor |
1 |
0 |
Less than a year |
Less than a year |
Dealer |
Infringement of contract |
ZAO TorgMebel |
15 |
4 |
From three to ten years |
Less than a year |
Distributor |
Terminated by contractor |
ZAO TorgMebel |
1 |
10 |
From three to ten years |
From one to three years |
Distributor |
Terminated by contractor |
ICP Dubrava |
1 |
1 |
From one to three years |
Less than a year |
Dealer |
Terminated by contractor |
Store 15 |
1 |
1 |
Over 10 years |
From three to ten years |
L o n g - t e r m partner |
Not terminated |
OOO Gross |
3 |
2 |
Less than a year |
Less than a year |
L o n g - t e r m partner |
Not terminated |
Contractor |
Number of retail spots |
Number of cars |
Company age |
Agreement date |
Type of agreement |
Status of relations |
OOO Intaris |
7 |
3 |
From three to ten years |
From one to three years |
L o n g - t e r m partner |
Terminated by contractor |
OOO TorgTrest |
2 |
2 |
Over 10 years |
From three to ten years |
L o n g - t e r m partner |
Not terminated |
PBOUL Kuro- chkin |
0 |
1 |
Less than a year |
Less than a year |
Dealer |
Not terminated |
To perform this analysis, use a code fragment similar to the one below:
&OnClient Procedure DecisionTree (Command) Result = AnalysisDecisionTree(); EndProcedure &OnServerWithoutContext Function AnalysisDecisionTree(); Analysis = New DataAnalysis; Analysis.AnalysisType = Type("DataAnalysisDecisionTree"); Group = Catalogs.Contractors.FindByDescription("Legal entities"); Query = New Query; Query.Text = " |SELECT |Contractors.Reference, |Contractors.RetailCount, |Contractors.VehicleCount, |Contractors.OrganizationPeriodOfWork, |Contractors.AgreementDate, |Contractors.ContractKind, |Contractors.TerminationOfRelations, |FROM |Catalog.Contractors AS Contractors |WHERE |(NOT Contractors.IsFolder AND Contractors.Parent = &Parent)"; Query.SetParameter("Parent", Group); Analysis.DataSource = Query.Execute(); Analysis.Parameters.SimplificationType.Value = DecisionTreeSimplificationType.DontSiplify; AnalysisResult = Analysis.Execute(); Builder = New DataAnalysisReportBuilder(); Builder.Template = Undefined; Builder.AnalysisType = Type("DataAnalysisDecisionTree"); Spreadsheet = New SpreadsheetDocument; Builder.Output(AnalysisResult, Spreadsheet); Return Spreadsheet; EndFunction
The following decision tree is the result of the analysis:
Fig. 292. Decision tree
This tree can be presented as the following schema:
Fig. 293. Schema view of the decision tree
Classification errors appear when the received rules do not match the reality (source data selection):
Fig. 294. Classification errors
Based on the data specified, there are no errors in the received classification, i.e. data in the actual selection matches classification data.
An example above is based on the DontSimplify value of the SimplificationType analysis parameter. This parameter value is set programmatically in the example above. If a Simplify value is set for the parameter, the decision tree will look as follows:
Fig. 295. Decision tree
Tree simplification means that tree nodes are turned into leaves (redundant branching is pruned) using specific rules (or formulas, see below).
The following should be taken into consideration in deciding whether a node should be turned into a leaf:
Errors – number of errors in a node
ChildErrors – number of errors in child nodes
Leaves – number of leaves in a node
Cases – number of cases
The following condition must be met to turn a node into a leaf:
In this example, the condition is satisfied for nodes Company age (0.5 < 1).
Fig. 296. Classification errors
For instance, in one case the real sample contained the value Terminated by contractor, while according to classification it should have been Not terminated, etc.
14.6. THE CLUSTERIZATION ANALYSIS TYPE
A cluster analysis is a mathematical multidimensional analysis that uses multiple indicators that characterize a number of objects so as to group them in clusters so that the objects of one cluster are more homogeneous, are similar to each other when compared to objects of other clusters.
This analysis is based on calculating the distance between objects. Based on distances between objects, they are grouped into clusters. There are several different ways to measure the distance (different metrics can be used). The following metrics are supported
Euclidean
Squared euclidean
City block
Maximum
When a distance between objects is measured, one of several algorithms for distributing objects among clusters can be used. The following clusterization methods are supported:
Nearest neighbor
Furthest neighbor
K-means
Centroid
The cluster analysis mechanism can be schematically presented as follows:
Fig. 297. Cluster analysis execution diagram
A data source is provided as input to the DataAnalysis object. A query result, spreadsheet document, cell area, or a value table may function as a data source. Source columns are defined as input or not used. It should be noted that all column values are included in DataAnalysisColumnTypeClusterization system enumeration. This enumeration includes further values (both not used and input), but these other values are used to build forecasts.
Analysis is performed in accordance with the analysis parameters set.
Let us use the following code fragment as an example to illustrate how cluster analysis can be performed:
&OnClient Procedure ClusterAnalysis (Command) Result = AnalysisClusterization(); EndProcedure &OnServerWithoutContext Function AnalysisClusterization(); Analysis = New DataAnalysis; Analysis.AnalysisType = Type("DataAnalysisClusterization"); Group = Catalogs.Contractors.FindByDescription("Legal entities"); Query = New Query; Query.Text = " |SELECT |Contractors.Reference, |Contractors.RetailCount, |Contractors.VehicleCount, |Contractors.OrganizationPeriodOfWork, |Contractors.AgreementDate, |Contractors.ContractKind, |Contractors.TerminationOfRelations, |FROM |Catalog.Contractors AS Contractors |WHERE |(NOT Contractors.IsFolder AND Contractors.Parent = &Parent)"; Query.SetParameter("Parent", Group); Analysis.DataSource = Query.Execute(); // Metrics selection. Analysis.Parameters.DistanceMetric.Value = DataAnalysisDistanceMetricType.SquaredEuclidean; // Clusterization method selection. Analysis.Parameters.ClusterizationMethod.Value = ClusterizationMethod.KMean; AnalysisResult = Analysis.Execute(); Builder = New DataAnalysisReportBuilder(); Builder.Template = Undefined; Builder.AnalysisType = Type("DataAnalysisClusterization"); Spreadsheet = New SpreadsheetDocument; Builder.Output(AnalysisResult, Spreadsheet); Return Spreadsheet; EndFunction
A query is processed for catalog Contractors. According to the conditions of the query, only detailed catalog entries from the Legal entities group are selected.
Executing this code will define the following values as initial analysis settings (some of them are set explicitly, others are set by default).
Fig. 298. Analysis parameters
The content of these columns has been defined on the basis of the query selection fields. By default, they are defined with an equal weighting. The Contiguous data type has been defined for the Number and Date types, with the Discrete data type for all other types. If column parameters need to be changed, this can be done as follows:
Analysis.ColumnsSetting.VehicleCount.AdditionalParameters.Weight = 2;
Weighting has been increased for column VehicleCount in this string.
The data selection which will be analyzed is as follows:
Contractor |
Number of retail spots |
Number of cars |
Company age |
Agreement date |
Type of agreement |
Status of relations |
ZAO Igor |
1 |
0 |
Less than a year |
Less than a year |
Dealer |
Infringement of contract |
ZAO TorgMebel |
15 |
4 |
From three to ten years |
Less than a year |
Distributor |
Terminated by contractor |
ZAO TorgMebel |
1 |
10 |
From three to ten years |
From one to three years |
Distributor |
Terminated by contractor |
ICP Dubrava |
1 |
1 |
From one to three years |
Less than a year |
Dealer |
Terminated by contractor |
Store 15 |
1 |
1 |
Over 10 years |
From three to ten years |
L o n g - t e r m partner |
Not terminated |
OOO Gross |
3 |
2 |
Less than a year |
Less than a year |
L o n g - t e r m partner |
Not terminated |
Contractor |
Number of retail spots |
Number of cars |
Company age |
Agreement date |
Type of agreement |
Status of relations |
OOO Intaris |
7 |
3 |
From three to ten years |
From one to three years |
L o n g - t e r m partner |
Terminated by contractor |
OOO TorgTrest |
2 |
2 |
Over 10 years |
From three to ten years |
L o n g - t e r m partner |
Not terminated |
PBOUL Kuro- chkin |
0 |
1 |
Less than a year |
Less than a year |
Dealer |
Not terminated |
The analysis result will be as follows:
Fig. 299. Cluster analysis result
Note that the analysis outputs data concerning the clusters found (their number, centers, and the distance between them). No data about which objects (in our case, contractors) are members of what clusters is obtained from the analysis.
This behavior can be observed if the analysis parameters are not set explicitly (the TableFillType parameter, in particular).
To see the object distribution among the clusters as a result of the analysis, define the following code string before the analysis (but after defining its type):
Analysis.Parameters.TableFillType.Value = DataAnalysisResultTableFillType.UsedFields;
14.6.1. Metrics used
First of all, we note the following: even though input columns in the example above were of a contiguous type (a notion of distance is obvious for this type), discrete type columns (references to catalogs, enumeration values, etc.) can be used in the analysis.
We now review the metrics that can be used in a cluster analysis.
14.6.1.1. Euclidean
This metric calculates the distance between two objects using the following formula:
Where:
Xi, Yi – attribute values of two objects (the distance between which we need to determine)
Wi – weighting factor of an attribute (set in the analysis column)
i – attribute number, from 1 to n
n – number of attributes
For instance, objects are characterized by a property that is 9 for one object and 5 for another one. The weighting factor of this attribute is 1. The distance between the objects is as follows:
14.6.1.2. Squared euclidean
This metric calculates the distance between two objects using the following formula:
Where:
Xi, Yi – attribute values of two objects (the distance between which we need to determine)
Wi – weighting factor of an attribute (set in the analysis column)
i – attribute number, from 1 to n
n – number of attributes
For instance, objects are characterized by a property that is 5 for one object and 3 for another one. The weighting factor of this attribute is 2. The distance between the objects is as follows:
14.6.1.3. City block
This metric calculates the distance between two objects using the following formula:
Where:
Xi, Yi – attribute values of two objects (the distance between which we need to determine);
Wi – weighting factor of an attribute (set in the analysis column);
i – attribute number, from 1 to n; n – number of attributes.
For instance, objects are characterized by two attributes that have values of 3 and 5, 7 and 3. The weighting factor of the first one is 2, of the second – 1:
Fig. 300. Object characteristics
14.6.1.4. Maximum
This metric calculates the distance between two objects using the following formula:
Where:
Xi, Yi – attribute values of two objects (the distance between which we need to determine)
Wi – weighting factor of an attribute (set in the analysis column)
i – attribute number, from 1 to n
n – number of attributes
For instance, objects are characterized by two attributes that have values of 3 and 5, 7 and 3. The weighting factor of the first one is 2, of the second – 1 (see fig. 300):
14.6.2. Clusterization methods
A variant of the clusterization method determines the principles that will be used to affiliate an object to a certain group, and which algorithm will be used to form clusters.
Any clusterization algorithm is aimed at the following:
Minimizing variability inside clusters,
Maximizing variability between clusters.
Differences between these methods will be examined with the objects shown in figure (see fig. 301).
Let us imagine that objects are distributed into two groups. The first one includes objects 1, 2, and 3. The second group consists of objects 4, 5, and 6.
Fig. 301. Groups of objects
14.6.2.1. Nearest neighbor
A clusterization method that connects an object to the group with a minimal distance to the nearest object.
In this example, object 7 will belong to a group that includes object 4. The closest objects of two groups are objects 4 and 3. The distance to object 4 is minimal.
14.6.2.2. Furthest neighbor
A clusterization method that connects an object to the group with a minimal distance to the furthest object.
In this example, object 7 will belong to a group that includes object 5. The furthest objects of two groups are objects 1 and 5. The distance to object 5 is less great.
14.6.2.3. Centroid
A clusterization method that connects an object to the group with a minimum distance to the centroid.
Fig. 302. Groups of objects
The example shown in the picture adds object 7 to a group that contains objects 4, 5, and 6. The distance to the centroid (an imaginary object with average attribute values) is minimal.
14.6.2.4. K-means
This method selects the objects that come first in the selection. They are considered to be centers of clusters. Then, the following object is selected, and is affiliated to one or another cluster based on its distance to the centers of clusters. The center of the cluster to which an object has been added is recalculated.
The procedure continues until all objects have been analyzed. New selection of objects is made afterwards (starting from the first one). The procedure is repeated until cluster centers change.
Fig. 303. Sample objects layout
For instance, objects 1 and 2 have been arbitrarily selected as cluster centers. Object 3 is added to the cluster with a center in object 1. The center of the first cluster is recalculated (it is between object 1 and 3). Object 4 is added to the second cluster (its center is also recalculated).
After all objects under analysis have been processed, objects 1 and 3 relate to cluster 1, while the second cluster includes all other objects (with a center supposedly in the center of a triangle formed by objects 4, 7, and 6).
Then objects are selected and distributed between the clusters again (cluster centers are constantly recalculated).
In the third selection of objects, object 2, which was initially the center of the second cluster will probably relate to the first cluster.
At the end of algorithm execution, cluster 1 will include objects 1, 2, and 3.
The second cluster will include objects 4, 5, 6, and 7.
14.6.2.5. Data output to a dendrogram
If an algorithm that differs from a K-Mean algorithm is used to output cluster analysis data, the results of such a cluster analysis are output as a dendrogram (the analysis algorithm should support outputting cluster distribution of the objects under analysis):
Fig. 304. Dendrogram