1C:Enterprise 8.3. Developer Guide. Chapter 14. Data Analysis and Forecasting

1C:Enterprise 8.3. Developer Guide. Contents


DATA ANALYSIS AND FORECASTING

Data analysis and forecast is intended for implementing tools for discovering dependencies that are usually hidden behind large amounts of data.For instance, one can analyze sales data and identify groups of goods that are usually purchased together. In the future (one of multiple variants) this information can be used to merchandize goods at a retail store. Goods may be grouped together (when a buyer comes to a store, she or he sees a barbecue and firelighters liquid, charcoal, meat, fishing tackle and a rubber boat nearby and purchases all of it) or in different areas of the store (a buyer comes to buy milk, and by the time he or she gets to the bread section they will have traversed half of the store).

Another use of the data analysis mechanism is forecasting a contractor’s behavior based on the existing data. By completing such an analysis one can find out to what degree their purchasing volumes depend on territorial distribution, the size of the company, period of cooperation, and other factors. Based on these dependencies, a new contractor’s behavior may be forecast, and a strategy of cooperation developed.

Use the forecasting functionality to plan your purchasing campaign. For instance: last month, a pet store sold 100 guinea pigs. The store must plan purchasing volumes for the next month. One of the most widely used ways to do this is to apply an adjustment factor to past sales periods. For instance, the adjustment factor (demand increase factor) is 1.5. Therefore, it is reasonable to plan to purchase 150 guinea pigs for the next month. However, if we analyze what customers buy after they buy such a pet, other conclusions can be drawn. By using data analysis and forecasting features one can see that pet food, litter, hay, and other "accessories" should be purchased.

It should be noted that this chapter mostly reviews 1C:Enterprise mechanisms and only briefly mentions ways to use the information obtained through simple examples.

14.1. OVERVIEW

A general overview of the data analysis and forecasting mechanism can be presented as follows:

Fig. 279. Interaction of the data analysis mechanism elements

This is the mechanism to be used to work with infobase data and data from other sources, preloaded into a value table or a spreadsheet document.

Apply one of the types of analysis to the source data to obtain the analysis results. The analysis results are a given data behavior model; it can be displayed in the final document or stored for further use.

The analysis results can be used to create a forecast model to forecast new data behavior, in line with the existing model.

For instance, one can analyze which goods are purchased together (on a single invoice) and store the analysis-based forecasting model in the database. In the future, when an invoice is created, a previously stored forecasting model can be retrieved from the database. One can input new data from an invoice and receive a forecast as an output, i.e. a list of goods that the next customer will probably buy (with a certain degree of likelihood) if these goods are offered at a store.

14.1.1. Main Mechanism Objects

Interaction between the main objects of the data analysis and forecasting mechanism can be shown as follows:

Fig. 280. Interconnection of main objects

Data analysis is an object responsible for data analysis. A data source is set for this object, different parameters and source data are specified. This object results in a data analysis result, while each type of analysis has its own object for working with the analysis results.

„ DataAnalysisSummaryStatisticsResult

„ DataAnalysisAssociationRulesResult

„ DataAnalysisSequentialPatternsResult

„ DataAnalysisDecisionTreeResult

„ DataAnalysisClusterizationResult

Setting data analysis columns – a collection of data analysis input columns. A data type, column role, and additional settings depending on the type of analysis performed are specified for each column. Data analysis parameters – a set of parameters to be used in data analysis. The range of parameters depends on the selected type of analysis. For instance, the number of clusters to divide source objects into, a measurement of distance between objects, etc. should be specified for cluster analysis.

Data source – source data for the analysis. A query result, spreadsheet document cell area, or a value table may function as a data source.

Data analysis result – a special object that contains information on the result of analysis. Each type of analysis provides its own result. For instance, Decision tree data analysis result will be an object of type DataAnalysisDecisionTreeResult. In the future, the result may be output to a spreadsheet document with the help of the data analysis result builder. It can be output through a programmatic access to its content, and can also be used to create a forecast model. Any data analysis result may be stored for further use.

Forecast model – is a special object to execute forecasting on the basis of input data (a forecast selection, selection columns settings, result settings, and analysis result). The type of the forecast model depends on the type of data analysis result. For instance, a model created for Association rules, will have type Predic- tionModelAssociationRules. This model will output forecasts of the following type: since this customer has purchased a specified set of goods, he or she will buy another set of goods with a certain probability. The forecasting data source is an input to the forecasting model. The result is a value table that contains forecast values.

Input column setup – is a set of special objects that show correspondence between forecasting model columns and forecast selection columns. For instance, a forecast model column named Goods may match a Nomenclature selection column.

Result columns setup – controls which columns will be included in the resulting forecast model table. For instance, a nomenclature item that the customer will probably buy and the probability of purchase can be output as an association rules result.

Result columns – is a table of results that includes columns specified in output column settings and contains forecast data. The specific content depends on the type of analysis.

14.1.2. Types of Data Analysis

The data analysis and forecasting mechanism implements a number of data analysis types:

„ Summary statistics

„ Association rules

„ Sequential patterns

„ Cluster analysis

„ Decision tree

14.1.2.1. Summary statistics

The Summary statistics analysis type is a mechanism for gathering general information on the data from the obtained data source. This type of analysis is used to pre-analyze the information.

This analysis shows a number of categorical and contiguous fields characteristics. When a report is output to a spreadsheet document, pie charts to display the contents of the fields are compiled.

14.1.2.2. Association rules

This type of analysis searches for the groups of objects or characteristic values that usually go together, and searches for association rules. Association rules can be used to determine what goods and services are usually purchased together.

This type of analysis can be used with hierarchical data to find association rules for specific goods and groups of goods. An important feature of this type of analysis is its capability to work with both object based data sources (where each column contains a certain characteristic of an object) and with event based sources where object characteristics are placed together in a single column.

14.1.2.3. Sequential patterns

The Sequential patterns analysis type helps reveal event sequences in a data source. For instance, this may be a chain of goods or services that are usually bought in sequence.

This type of analysis is useful for searching through a hierarchy and tracking sequential specific and sequential parent groups.

14.1.2.4. Cluster analysis

A cluster analysis is a way to divide a source set of objects being analyzed into groups of objects, so that each object more closely resembles objects in the same group rather than the objects of other groups. In the future, when the received groups (called clusters) are analyzed, one can define what characterizes a specific group, and make decisions on different methods used for working with objects from different groups. For instance, you can use cluster analysis to divide customers into groups and use different customer relationship strategies.

Use cluster analysis parameters to set up an algorithm in order to split and dynamically change sets of characteristics that are taken into account in analyses, and to specify weighting factors for them.

Clusterization results may be output as a dendrogram, a special type of diagram for graphically representing cluster analysis results.

14.1.2.5. Decision tree

The Decision tree analysis type is a way to create a hierarchical structure of classifying rules presented as a tree.

To build a decision tree, select a target attribute to base the classifier on, as well as a number of output attributes to be used to create the rules. A target attribute may contain information on whether a customer started using another service provider, whether the transaction has been successful and the work has been completed successfully, etc. Possible input attributes are as follows: age of an employee, his or her period of employment, the material wealth of the customer, the number of employees in the company, etc.

Analysis results are presented as a tree, each node of which contains a condition. In order to decide which class a certain new object should be referred to, it is necessary to answer questions in the nodes and complete a chain of steps from the root to a leaf of the tree. A positive answer takes the user to the next sub-node, while a negative answer takes the user to a neighboring node.

A set of analysis parameters can be used to control the accuracy of the resulting tree.

14.1.3. Forecasting Models

Forecasting models created by the mechanism are special objects generated from the data analysis results. In future they can be used to automatically forecast new data.

For instance, an association rules forecasting model created when the customer purchases were analyzed can be used to work with a customer at a store, so as to offer the goods that he or she will probably buy in conjunction with the main purchase.

14.2. THE SUMMARY STATISTICS ANALYSIS TYPE

The Summary statistics analysis type can be used for preliminary data analysis (before any other type of analysis is completed), etc.

A query result, a value table, or a cell area of a spreadsheet document can be used as a source of data for analysis.

Source data (from the point of view of the analysis taken) may be contiguous or discrete. Contiguous types may be Number and Date. Other types are discrete.

Different information can be obtained for the columns of different types.

Discrete data:

„ Number of values – is the number of values in a data source column (NULL is not considered a value).

„ Number of unique values (except for repeated values).

„ Mode – is the most frequent value in the data source. If the data contains several values that occur with the same frequency, the mode is the first such value found.

„ Frequency is the number of occurrences of a value in the data selection.

„ Relative frequency is defined as a relation between the number of occurrences of the value to the total number of values.

„ Accumulated frequency is the total of the value frequency and the total of frequencies of previous data selection values.

„ Accumulated relative frequency is the total of the accumulated value frequency and the total of relative frequencies of previous values.

Contiguous data:

„ Number of values.

„ Minimum value.

„ Maximum value.

„ Average.

„ Range is the difference between the maximum and the minimum values.

„ Standard deviation (root-mean-square deviation). „ Median is a value in the middle of the selection.

It should be noted that if several fields of different types are analyzed simultaneously, they are analyzed independently (without any mutual correlation).

Let us review the characteristics mentioned in the example.

Data selection (source of analysis) looks as follows:

Nomenclature item

Quantity

Nomenclature item

Quantity

Folding dining table

1

Folding dinner table

1

Round stool

2

Square stool

3

"Coziness" sofa

1

"COMFORT" armchair

2

"Jeans" sofa

1

"COMFORT" armchair

2

"Jeans" armchair

2

"Wardrobe" closet

1

Kitchen table 0.9x1.7

1

Folding dinner table

1

"COMFORT" sofa

1

Square stool

2

Kitchen table 0.9x1.7

1

Dining table

1

"Summer" chair

4

"Summer" chair

2

"COMFORT" sofa

1

Round stool

2

The following characteristics will be calculated, based on data analysis for the Count field (Contiguous analysis data type):

Characteristic

Value

Values

20

Minimum

1

Maximum

4

Average

1.6

Range

3

Standard deviation

0.8208

Median

1

The following characteristics will be obtained for the Nomenclature field:

Characteristic

Value

Number of values

20

Number of unique values

12

Mode

Folding dining table

A frequency table for the nomenclature values will look as follows:

Fig. 281. Frequency table

The relative frequency is shown in the diagram below.

Fig. 282. Frequency diagram

To perform this analysis, use a code fragment similar to the one below:

&OnClient
Procedure  Summary Statistics(Command)
Result = AnalysisSummaryStatistics();
EndProcedure

&OnServerWithoutContext
Function  AnalysisSummaryStatistics();
Analysis = New DataAnalysis;
Analysis.AnalysisType =  Type("DataAnalysisSummaryStatistics");

Query = New Query;
Query.Text = "
|SELECT
|Sales.Nomenclature,
|Sales.Count
|FROM
|AccumulationRegister.Sales AS Sales";

Analysis.DataSource = Query.Execute();
AnalysisResult = Analysis.Execute();

Builder = New DataAnalysisReportBuilder();
Builder.Template = Undefined;
Builder.AnalysisType =  Type("DataAnalysisSummaryStatistics");

Spreadsheet = New SpreadsheetDocument;
Builder.Output(AnalysisResult, Spreadsheet);
Return Spreadsheet;
EndFunction

Data analysis operations are performed in a server’s out-of-context function that returns a spreadsheet document containing the analysis results to the client. First of all, the DataAnalysis object is created. Next, the type of analysis to be completed is selected.

A query is then defined based on the text. The query result is set as the source of the analysis data. The analysis itself is completed when the Execute() method of the DataAnalysis object is executed. The analysis itself has no tools for visualizing analysis results. The DataAnalysisReportBuilder object is used for this purpose. When this object is created, the type of analysis to be conducted is re-specified. Then, the result of the received analysis is transferred as the first parameter of the Put() method, and the SpreadsheetDocument object created earlier is transferred as the second parameter.

At the conclusion of the algorithm, the spreadsheet document containing the analysis result is returned to the client’s Result data processor attribute with type

SpreadsheetDocument.

As a result, data similar to that analyzed above will be obtained.

14.3. THE ASSOCIATION RULES ANALYSIS TYPE

As has already been mentioned, this analysis type searches for combinations of objects or characteristic values that frequently go together. This is a way for determining groups for goods that are usually purchased together, so as to identify the most attractive information sources (optimizing costs in respect of these sources), etc.

A schematic view of the Association rules analysis type is as follows:

Fig. 283. The association rules analysis type: execution diagram

A query result, spreadsheet document cell area, or a value table may function as a data source. From the point of view of this type of analysis, source columns may be divided as follows:

„ NotUsed – ignored by the analysis.

„ Object – data from this column is used as objects (or events) of the executed analysis. Based on the values of this column, values of another column (Item) refer to one associated group.

„ Item – data from this column is used to obtain stable groups of values and create association rules.

The following analysis parameters impact the analysis result, together with column types settings:

„ MinSupport – determines the minimum percentage of cases when a certain combination of elements should occur. The groups where this value is less than the specified value are not included into the analysis result.

„ MinConfidence – shows the minimum percentage of cases when the rule is followed.

„ MinImportance groups with a value less than the specified value are not included into the analysis result.

„ PruneRulesType – one of the variants of the AssociationRulesPruneType system enumeration:

Redundant – redundant rules are pruned.

Covered – the rules covered by other rules are pruned.

The result of analysis is as follows:

„ Information on the data (number of objects, number of items, average number of items in an object, number of groups found, number of association rules found).

„ Groups of items found – the contents of the group, the number of cases, and the percentage of cases where this group occurs is specified.

„ Association rules detected – a source set of elements, consequent (structure of elements), percentage of cases, confidence, and importance of the rule are specified.

Let us review the peculiarities of this type of analysis with the following data selection (we will try to determine a standard set of goods usually purchased together):

Recorder

Nomenclature

Sales invoice No. 000000001

Folding dining table

Round stool

Sales invoice No. 000000002

COMFORT sofa

Sales invoice No. 000000003

Jeans sofa

Jeans armchair

Sales invoice No. 000000005

Kitchen table 0.9x1.7

COMFORT sofa

Sales invoice No. 000000004

Kitchen table 0.9x1.7

"Summer" chair

COMFORT sofa

Sales invoice No. 000000006

Folding dinner table

Square stool

Sales invoice No. 000000007

COMFORT armchair

Sales invoice No. 000000008

COMFORT armchair

Sales invoice No. 000000009

Wardrobe closet

Sales invoice No. 000000010

Folding dining table

Square stool

Dinner table

Sales invoice No. 000000011

"Summer" chair

Round stool

An attribute through which data is related to one group is called the recorder value (a nomenclature specified in one document is considered to be simultaneously purchased). That means that the Recorder will be an object of analysis, and the Nomenclature will be an item of analysis.

The following code fragment will be used for analysis:

&OnClient
Procedure  AssociationRules(Command)
Result = AnalysisAssociationRules();
EndProcedure

&OnServerWithoutContext
Function  AnalysisAssociationRules();
Analysis = New DataAnalysis;
Analysis.AnalysisType =  Type("DataAnalysisAssociationRules");

Query = New Query;
Query.Text = "
|SELECT
|Sales.Recorder,
|Sales.Nomenclature
|FROM
|AccumulationRegister.Sales AS Sales";

Analysis.DataSource = Query.Execute();

// String used as an example,
// a default column type value.
Analysis.ColumnSetting.Nomenclature.ColumnType  = DataAnalysisColumnTypeAssociationRules.Item;

// String used as an example,
// a default prune type value.
Analysis.Parameters.RulesPruneType.Value =  AssociationRulesPruneType.Redundant;

AnalysisResult = Analysis.Execute();

Builder = New DataAnalysisReportBuilder();
Builder.Template = Undefined;
Builder.AnalysisType =  Type("DataAnalysisAssociationRules");

Spreadsheet = New SpreadsheetDocument;
Builder.Output(AnalysisResult, Spreadsheet);

Return Spreadsheet;
EndFunction

The analysis results will look as follows:

Fig. 284. Association rules analysis result

The selection uses data from 11 documents (a reference in the Recorder field), the number of different nomenclature items is twelve:

Nomenclature

Folding dining table

Round stool

"Coziness" sofa

Jeans sofa

Jeans armchair

Kitchen table 0.9x1.7

"COMFORT" sofa

"Summer" chair

Square stool

"COMFORT" armchair

Wardrobe closet Dining table

The following group of goods has been found:

Fig. 285. Group of goods found

The whole group occurs in the document in only two cases out of eleven (which is shown in columns Number of cases and Percentage of cases).

The following association rules have been received:

Fig. 286. Association rules

Let us review the second one. Position Square stool occurred together with position Folding dining table in two cases out of eleven in this document. Based on this, the support has been calculated: (2/11*100 = 18.18%).

Confidence has been calculated as follows: both nomenclature items have been purchased in two cases, while position Folding dining table occurred 3 times. Based on this, confidence is equal to 2/3*100 = 66.67%.

Importance is evaluated as a ratio between the rule’s confidence and the support of Square stool position in the goods purchased. This position occurs in two documents out of eleven (18.18%). Importance is equal to 66.67%/18.18% = 3.67.

14.3.1. Rules Prune Types

Let us review an important parameter of this analysis type, i.e. PruneRulesType. System enumeration AssociationRulesPruneType contains the following prune values:

„ Covered

„ Redundant

Before proceeding to review prune variants, we will examine a number of general principles applied to association rules.

Any rule contains an antecedent and a consequent. For example:

„ Antecedent: If Product 1 has been purchased.

„ Consequent: Then Product 2 will also be purchased.

Please bear in mind that the consequent has a certain degree of confidence. In prune rules, probability characteristics may be taken into account or may be ignored (the content of the rule is the only thing that really matters).

14.3.1.1. Covered rules pruning

Let us review the Covered pruning option.

A rule may be covered by an antecedent or by a consequent. For example:

„ Rule 1. If products 1 and 3 have been purchased, then product 2 will also be purchased.

„ Rule 2. If product 1 has been purchased, then product 2 will also be purchased.

In this case rule 1 is considered covered, as the antecedent of the first rule is redundant in respect of the antecedent of the second rule.

An example of coverage by a consequent:

„ Rule 1. If product 1 has been purchased, then products 2 and 3 will also be purchased.

„ Rule 2. If product 1 has been purchased, then product 3 will also be purchased.

Rule 2 is covered by a consequent as the consequent of rule 1 is broader.

14.3.1.2. Pruning redundant rules

Coverage does not take into account the probability characteristics of the rules. They are only considered if the Redundant pruning type is used.

A rule is considered redundant by antecedent if it is covered by an antecedent and its confidence is equal to the confidence of the covering rule. For example:

„ Rule 1. If products 1 and 3 have been purchased, then product 2 will be purchased with 75% confidence.

„ Rule 2. If product 1 has been purchased, then product 2 will be purchased with 75% confidence.

Rule 1 is redundant towards rule 2 (it contains an additional condition that does not "disturb" the confidence characteristics of the rule).

Rule 1 is considered redundant by consequent if the number of cases of this rule is equal to the number of cases of the covering rule.

„ Rule 1. If product 1 has been purchased, then products 2 and 3 will be purchased in three cases.

„ Rule 2. If product 1 has been purchased, then product 3 will be purchased in three cases.

Rule 2 is redundant towards rule 1, as it contains a simpler consequent with the same confidence characteristics.

14.4. THE SEQUENTIAL PATTERNS ANALYSIS TYPE

This type of analysis reveals sequences of events (sequence templates). It can be used when a consequence of events over a period is one of the important indicators being analyzed. For instance, a pattern of goods purchased in a sequence within a certain period of time etc. may be identified.

A schema of the Sequential patterns analysis procedure is shown on fig. 287.

A query result, spreadsheet document cell area, or a value table may function as a data source. From the point of view of this type of analysis, source columns may be divided into the following: „ NotUsed – ignored in the analysis.

„ Sequence – data from this column is used in the analysis as an object of a sequence of events. The analysis uses a value in this column to associate data with a certain sequence of events.

„ Item – data from this column is used as sequence elements.

„ Time – this column is used to determine the time of event. This is a mandatory column for this type of analysis.

Fig. 287. A sequential patterns analysis execution diagram

The following analysis parameters impact the result of the analysis together with column type settings:

„ MinSupport – the minimum percentage of sequences where the sequence template is observed.

„ MinInterval – an attribute used to set the minimum sequence interval (an interval measurement unit, i.e. repetition, shall be defined).

„ MaxInterval – an attribute used to set the maximum sequence interval (an interval measurement unit, i.e. repetition, shall be defined).

„ TimeSliceWindow – an attribute setting the time slice window (a time slice window measurement unit, i.e. its repetition, shall be defined).

„ MinLength – the minimum length of sequences searched.

„ FindInHierarchy – a flag of hierarchy search (covers columns of the Item type).

A number of properties use DataAnalysisTimeIntervalUnitType. This system enumeration contains the following values:

Second

 

Minute

CurrentMinute

Hour

CurrentHour

Day

CurrentDay

Week

CurrentWeek

TenDays

CurrentTenDays

Month

CurrentMonth

Quarter

CurrentQuarter

HalfYear

CurrentHalfYear

Year

CurrentYear

The sequence templates found are the main result of the analysis. These templates contain the following information:

„ contents of a sequence template

„ number of cases when this sequence has been observed

„ maximum intervals between events (if there are only 2 events, there is one interval)

„ minimum intervals between events (if there are only 2 events, there is one interval)

„ percentage of cases when the sequence has been executed

„ average intervals between events (if there are only 2 events, there is one interval)

Let us review how this type of analysis is executed using the following data selection:

Contractor

First purchase

Second purchase

Third purchase

Interval

V.I. Bondarev

Folding dining table

COMFORT sofa

COMFORT

armchair

25 days, 31 days

Round stool

I.P. Ivanov

Jeans sofa

 

 

 

Jeans armchair

B.S. Petrov

Kitchen table 0.9x1.7

COMFORT

armchair

 

43 days

"Summer" chair

COMFORT sofa

G.O. Sidorov

Folding dining table

 

 

 

Square stool

V.K. Stepanov

Folding dining table

 

 

 

Square stool

Dining table

D.E. Fedorov

Kitchen table 0.9x1.7

Wardrobe closet

"Summer" chair

58 days, 29 days

Comfort sofa

Round stool

Data from the Contractor column will define affiliation to a certain chain of events, i.e. they define the sequence of analysis. The Nomenclature is an element of the sequence received.

To perform this analysis, use a code fragment similar to the one below:

&OnClient
Procedure  SequentialPatterns(Command)
Result = AnalysisSequentialPatterns();
EndProcedure

&OnServerWithoutContext
Function  AnalysisSequentialPatterns();
Analysis = New DataAnalysis;
Analysis.AnalysisType =  Type("DataAnalysisSequentialPatterns");

Query = New Query;
Query.Text = "
|SELECT
|Sales.Contractor,
|Sales.Nomenclature,
|Sales.Period
|FROM
|AccumulationRegister.Sales AS Sales";

Analysis.DataSource = Query.Execute();

Analysis.ColumnsSetting.Period.ColumnType  = DataAnalysisColumnTypeSequentialPatterns.Time;
AnalysisResult = Analysis.Execute();

Builder = New DataAnalysisReportBuilder();
Builder.Template = Undefined;
Builder.AnalysisType =  Type("DataAnalysisSequentialPatterns");

Spreadsheet = New SpreadsheetDocument;
Builder.Output(AnalysisResult, Spreadsheet);
Return Spreadsheet;
EndFunction

The Period field is defined as Time directly from the code (it is not analytically).

Analysis parameters set by default:

Fig. 288. Analysis parameters

The following data has been obtained in the analysis:

Fig. 289. General analysis data

The number of elements is 12. This is also the number of nomenclature positions that occur in the data selection.

Two sequences have been found:

Fig. 290. Sequences found

The first sequence occurs in two cases out of five. Therefore, support is 40%. Since the sequence depth is 2, each of these intervals contains one value.

14.5. THE DECISION TREE ANALYSIS TYPE

This type of analysis can be used to obtain a cause-and-effect hierarchy of conditions that facilitates making  decisions. For instance, you can obtain a condition tree that will (within a certain degree of probability) help understand the reasons behind the termination of agreements with customers and define the conditions determining the agreement to be signed. Company managers may be profileoriented so as to serve different groups of customers, etc.

A schema of the Decision tree analysis procedure is shown on fig. 291.

From the point of view of this type of analysis, source columns may be divided into the following:

„ NotUsed

„ Input

„ Predictable

Analysis parameters used:

„ MinCaseCount – minimum number of items in a node

„ MaxDepth – maximum depth of the tree

„ SimplificationType – simplification type of the decision tree The result of the analysis is as follows:

„ decision tree

„ classification errors

Fig. 291. Decision tree analysis execution diagram

Let us review how this type of analysis is executed with the following data selection:

Contractor

Number 

of retail spots

Number of cars

Company age

Agreement date

Type  of agreement

Status  of relations

ZAO Igor

1

0

Less than a year

Less      than

a year

Dealer

Infringement of contract

ZAO TorgMebel

15

4

From three to ten

years

Less      than

a year

Distributor

Terminated

by contractor

ZAO TorgMebel

1

10

From three to ten

years

From one to three years

Distributor

Terminated

by contractor

ICP Dubrava

1

1

From one to three years

Less      than

a year

Dealer

Terminated

by contractor

Store 15

1

1

Over      10

years

From three to ten years

L o n g - t e r m

partner

Not           terminated

OOO Gross

3

2

Less than a year

Less      than

a year

L o n g - t e r m

partner

Not           terminated

Contractor

Number 

of retail spots

Number of cars

Company age

Agreement date

Type  of agreement

Status  of relations

OOO Intaris

7

3

From three to ten

years

From one to three years

L o n g - t e r m

partner

Terminated

by contractor

OOO TorgTrest

2

2

Over      10

years

From three to ten years

L o n g - t e r m

partner

Not           terminated

PBOUL      Kuro-

chkin

0

1

Less than a year

Less      than

a year

Dealer

Not           terminated

To perform this analysis, use a code fragment similar to the one below:

&OnClient
Procedure  DecisionTree (Command)
Result = AnalysisDecisionTree();
EndProcedure

&OnServerWithoutContext
Function  AnalysisDecisionTree();
Analysis = New DataAnalysis;
Analysis.AnalysisType =  Type("DataAnalysisDecisionTree");

Group = Catalogs.Contractors.FindByDescription("Legal  entities");
Query = New Query;
Query.Text = "
|SELECT
|Contractors.Reference,
|Contractors.RetailCount,
|Contractors.VehicleCount,
|Contractors.OrganizationPeriodOfWork,
|Contractors.AgreementDate,
|Contractors.ContractKind,
|Contractors.TerminationOfRelations,
|FROM
|Catalog.Contractors AS Contractors
|WHERE
|(NOT Contractors.IsFolder AND Contractors.Parent =  &Parent)";

Query.SetParameter("Parent", Group);

Analysis.DataSource = Query.Execute();

Analysis.Parameters.SimplificationType.Value  = DecisionTreeSimplificationType.DontSiplify;
AnalysisResult = Analysis.Execute();

Builder = New DataAnalysisReportBuilder();
Builder.Template = Undefined;
Builder.AnalysisType = Type("DataAnalysisDecisionTree");

Spreadsheet = New SpreadsheetDocument;
Builder.Output(AnalysisResult, Spreadsheet);
Return Spreadsheet;
EndFunction

The following decision tree is the result of the analysis:

Fig. 292. Decision tree

This tree can be presented as the following schema:

Fig. 293. Schema view of the decision tree

Classification errors appear when the received rules do not match the reality (source data selection):

Fig. 294. Classification errors

Based on the data specified, there are no errors in the received classification, i.e. data in the actual selection matches classification data.

An example above is based on the DontSimplify value of the SimplificationType analysis parameter. This parameter value is set programmatically in the example above. If a Simplify value is set for the parameter, the decision tree will look as follows:

Fig. 295. Decision tree

Tree simplification means that tree nodes are turned into leaves (redundant branching is pruned) using specific rules (or formulas, see below).

The following should be taken into consideration in deciding whether a node should be turned into a leaf:

„ Errors – number of errors in a node

„ ChildErrors – number of errors in child nodes

„ Leaves – number of leaves in a node

„ Cases – number of cases

The following condition must be met to turn a node into a leaf:

In this example, the condition is satisfied for nodes Company age (0.5 < 1).

Fig. 296. Classification errors

For instance, in one case the real sample contained the value Terminated by contractor, while according to classification it should have been Not terminated, etc.

14.6. THE CLUSTERIZATION ANALYSIS TYPE

A cluster analysis is a mathematical multidimensional analysis that uses multiple indicators that characterize a number of objects so as to group them in clusters so that the objects of one cluster are more homogeneous, are similar to each other when compared to objects of other clusters.

This analysis is based on calculating the distance between objects. Based on distances between objects, they are grouped into clusters. There are several different ways to measure the distance (different metrics can be used). The following metrics are supported

„ Euclidean

„ Squared euclidean

„ City block

„ Maximum

When a distance between objects is measured, one of several algorithms for distributing objects among clusters can be used. The following clusterization methods are supported:

„ Nearest neighbor

„ Furthest neighbor

„ K-means

„ Centroid

The cluster analysis mechanism can be schematically presented as follows:

Fig. 297. Cluster analysis execution diagram

A data source is provided as input to the DataAnalysis object. A query result, spreadsheet document, cell area, or a value table may function as a data source. Source columns are defined as input or not used. It should be noted that all column values are included in DataAnalysisColumnTypeClusterization system enumeration. This enumeration includes further values (both not used and input), but these other values are used to build forecasts.

Analysis is performed in accordance with the analysis parameters set.

Let us use the following code fragment as an example to illustrate how cluster analysis can be performed:

&OnClient
Procedure  ClusterAnalysis (Command)
Result = AnalysisClusterization();
EndProcedure

&OnServerWithoutContext
Function  AnalysisClusterization();
Analysis = New DataAnalysis;
Analysis.AnalysisType =  Type("DataAnalysisClusterization");

Group = Catalogs.Contractors.FindByDescription("Legal  entities");
Query = New Query;
Query.Text = "
|SELECT
|Contractors.Reference,
|Contractors.RetailCount,
|Contractors.VehicleCount,
|Contractors.OrganizationPeriodOfWork,
|Contractors.AgreementDate,
|Contractors.ContractKind,
|Contractors.TerminationOfRelations,
|FROM
|Catalog.Contractors AS Contractors
|WHERE
|(NOT Contractors.IsFolder AND Contractors.Parent =  &Parent)";

Query.SetParameter("Parent", Group);

Analysis.DataSource = Query.Execute();

// Metrics selection.
Analysis.Parameters.DistanceMetric.Value =
DataAnalysisDistanceMetricType.SquaredEuclidean;

// Clusterization method selection.
Analysis.Parameters.ClusterizationMethod.Value =  ClusterizationMethod.KMean;

AnalysisResult = Analysis.Execute();

Builder = New DataAnalysisReportBuilder();
Builder.Template = Undefined;
Builder.AnalysisType =  Type("DataAnalysisClusterization");

Spreadsheet = New SpreadsheetDocument;
Builder.Output(AnalysisResult, Spreadsheet);

Return Spreadsheet;
EndFunction

A query is processed for catalog Contractors. According to the conditions of the query, only detailed catalog entries from the Legal entities group are selected.

Executing this code will define the following values as initial analysis settings (some of them are set explicitly, others are set by default).

Fig. 298. Analysis parameters

The content of these columns has been defined on the basis of the query selection fields. By default, they are defined with an equal weighting. The Contiguous data type has been defined for the Number and Date types, with the Discrete data type for all other types. If column parameters need to be changed, this can be done as follows:

Analysis.ColumnsSetting.VehicleCount.AdditionalParameters.Weight = 2;

Weighting has been increased for column VehicleCount in this string.

The data selection which will be analyzed is as follows:

Contractor

Number 

of retail spots

Number of cars

Company age

Agreement date

Type  of agreement

Status  of relations

ZAO Igor

1

0

Less than a year

Less              than a year

Dealer

Infringement of contract

ZAO TorgMebel

15

4

From three to ten

years

Less              than a year

Distributor

Terminated by contractor

ZAO TorgMebel

1

10

From three to ten

years

From one to three years

Distributor

Terminated by contractor

ICP Dubrava

1

1

From one to three years

Less              than a year

Dealer

Terminated by contractor

Store 15

1

1

Over      10

years

From three to ten years

L o n g - t e r m

partner

Not           terminated

OOO Gross

3

2

Less than a year

Less              than a year

L o n g - t e r m

partner

Not           terminated

Contractor

Number 

of retail spots

Number of cars

Company age

Agreement date

Type  of agreement

Status  of relations

OOO Intaris

7

3

From three to ten

years

From one to three years

L o n g - t e r m

partner

Terminated by contractor

OOO TorgTrest

2

2

Over      10

years

From three to ten years

L o n g - t e r m

partner

Not           terminated

PBOUL     Kuro-

chkin

0

1

Less than a year

Less              than a year

Dealer

Not           terminated

The analysis result will be as follows:

Fig. 299. Cluster analysis result

Note that the analysis outputs data concerning the clusters found (their number, centers, and the distance between them). No data about which objects (in our case, contractors) are members of what clusters is obtained from the analysis.

This behavior can be observed if the analysis parameters are not set explicitly (the TableFillType parameter, in particular).

To see the object distribution among the clusters as a result of the analysis, define the following code string before the analysis (but after defining its type):

Analysis.Parameters.TableFillType.Value = DataAnalysisResultTableFillType.UsedFields;

14.6.1. Metrics used

First of all, we note the following: even though input columns in the example above were of a contiguous type (a notion of distance is obvious for this type), discrete type columns (references to catalogs, enumeration values, etc.) can be used in the analysis.

We now review the metrics that can be used in a cluster analysis.

14.6.1.1. Euclidean

This metric calculates the distance between two objects using the following formula:

Where:

„ Xi, Yi – attribute values of two objects (the distance between which we need to determine)

„ Wi – weighting factor of an attribute (set in the analysis column)

„ i – attribute number, from 1 to n

„ n – number of attributes

For instance, objects are characterized by a property that is 9 for one object and 5 for another one. The weighting factor of this attribute is 1. The distance between the objects is as follows:

14.6.1.2. Squared euclidean

This metric calculates the distance between two objects using the following formula:

Where:

„ Xi, Yi – attribute values of two objects (the distance between which we need to determine)

„ Wi – weighting factor of an attribute (set in the analysis column)

„ i – attribute number, from 1 to n

„ n – number of attributes

For instance, objects are characterized by a property that is 5 for one object and 3 for another one. The weighting factor of this attribute is 2. The distance between the objects is as follows:

14.6.1.3. City block

This metric calculates the distance between two objects using the following formula:

Where:

„ Xi, Yi – attribute values of two objects (the distance between which we need to determine);

„ Wi – weighting factor of an attribute (set in the analysis column);

„ i – attribute number, from 1 to n; „ n – number of attributes.

For instance, objects are characterized by two attributes that have values of 3 and 5, 7 and 3. The weighting factor of the first one is 2, of the second – 1:

Fig. 300. Object characteristics

14.6.1.4. Maximum

This metric calculates the distance between two objects using the following formula:

Where:

„ Xi, Yi – attribute values of two objects (the distance between which we need to determine)

„ Wi – weighting factor of an attribute (set in the analysis column)

„ i – attribute number, from 1 to n

„ n – number of attributes

For instance, objects are characterized by two attributes that have values of 3 and 5, 7 and 3. The weighting factor of the first one is 2, of the second – 1 (see fig. 300):

14.6.2. Clusterization methods

A variant of the clusterization method determines the principles that will be used to affiliate an object to a certain group, and which algorithm will be used to form clusters.

Any clusterization algorithm is aimed at the following:

„ Minimizing variability inside clusters,

„ Maximizing variability between clusters.

Differences between these methods will be examined with the objects shown in figure (see fig. 301).

Let us imagine that objects are distributed into two groups. The first one includes objects 1, 2, and 3. The second group consists of objects 4, 5, and 6.

Fig. 301. Groups of objects

14.6.2.1. Nearest neighbor

A clusterization method that connects an object to the group with a minimal distance to the nearest object.

In this example, object 7 will belong to a group that includes object 4. The closest objects of two groups are objects 4 and 3. The distance to object 4 is minimal.

14.6.2.2. Furthest neighbor

A clusterization method that connects an object to the group with a minimal distance to the furthest object.

In this example, object 7 will belong to a group that includes object 5. The furthest objects of two groups are objects 1 and 5. The distance to object 5 is less great.

14.6.2.3. Centroid

A clusterization method that connects an object to the group with a minimum distance to the centroid.

Fig. 302. Groups of objects

The example shown in the picture adds object 7 to a group that contains objects 4, 5, and 6. The distance to the centroid (an imaginary object with average attribute values) is minimal.

14.6.2.4. K-means

This method selects the objects that come first in the selection. They are considered to be centers of clusters. Then, the following object is selected, and is affiliated to one or another cluster based on its distance to the centers of clusters. The center of the cluster to which an object has been added is recalculated.

The procedure continues until all objects have been analyzed. New selection of objects is made afterwards (starting from the first one). The procedure is repeated until cluster centers change.

Fig. 303. Sample objects layout

For instance, objects 1 and 2 have been arbitrarily selected as cluster centers. Object 3 is added to the cluster with a center in object 1. The center of the first cluster is recalculated (it is between object 1 and 3). Object 4 is added to the second cluster (its center is also recalculated).

After all objects under analysis have been processed, objects 1 and 3 relate to cluster 1, while the second cluster includes all other objects (with a center supposedly in the center of a triangle formed by objects 4, 7, and 6).

Then objects are selected and distributed between the clusters again (cluster centers are constantly recalculated).

In the third selection of objects, object 2, which was initially the center of the second cluster will probably relate to the first cluster.

At the end of algorithm execution, cluster 1 will include objects 1, 2, and 3.

The second cluster will include objects 4, 5, 6, and 7.

14.6.2.5. Data output to a dendrogram

If an algorithm that differs from a K-Mean algorithm is used to output cluster analysis data, the results of such a cluster analysis are output as a dendrogram (the analysis algorithm should support outputting cluster distribution of the objects under analysis):

Fig. 304. Dendrogram

Leave a Reply

Your email address will not be published.

 

1C:Enterprise Developer's Community