Parameter Description / 参数介绍
Parameters
:param correlation : {'PearsonR(+)','PearsonR(-)',''MIC','R2'},default PearsonR(+).
Methods:
* PearsonR: (+)(-). for linear relationship.
* MIC for no-linear relationship.
* R2 for no-linear relationship.
:param tolerance_list: constraints imposed on features, default is null
list shape in two dimensions, viz., [['feature_name1',tol_1],['feature_name2',tol_2]...]
'feature_name1', 'feature_name2' (string) are names of input features;
tol_1, tol_2 (float, between 0 to 1)are feature's tolerance ratios;
the variations of feature values on each leaf must be in the tolerance;
if tol_1 = 0, the value of feature 'feature_name1' must be a constant on each leaf,
if tol_1 = 1, there is no constraints on value of feature 'feature_name1';
example: tolerance_list = [['feature_name1',0.2],['feature_name2',0.1]].
:param minsize : a int number (default=3), minimum unique values for linear features of data on each leaf.
:param threshold : a float (default=0.9), less than or equal to 1, default 0.95 for PearsonR.
In the process of dividing the dataset, the smallest relevant index allowed in the you research.
To avoid overfitting, threshold = 0.5 is suggested for MIC 0.5.
:param mininc : Minimum expected gain of objective function (default=0.01)
:param split_tol : a float (default=0.8), constrained features value shound be narrowed in a minmimu ratio of split_tol on split path
:param gplearn : Whether to call the embedded gplearn package of TCLR to regress formula (default=False).
:param gpl_dummyfea: dummy features in gpleran regression, default is null
list shape in one dimension, viz., ['feature_name1','feature_name2',...]
dummy features : 'feature_name1','feature_name2',... are not used anymore in gpleran regression
________________________________________________________________
params defined in package gplearn
________________________________________________________________
:param population_size : integer, optional (default=500), the number of programs in each generation.
:param generations : integer, optional (default=100),the number of generations to evolve.
:param verbose : int, optional (default=0). Controls the verbosity of the evolution building process.
:param metric : str, optional (default='mean absolute error')
The name of the raw fitness metric. Available options include:
- 'mean absolute error'.
- 'mse' for mean squared error.
- 'rmse' for root mean squared error.
- 'pearson', for Pearson's product-moment correlation coefficient.
- 'spearman' for Spearman's rank-order correlation coefficient.
:param function_set : iterable, optional (default=['add', 'sub', 'mul', 'div', 'log', 'sqrt',
'abs', 'neg','inv','sin','cos','tan', 'max', 'min'])
Notes
correlation :
Define the relationship between variables, in this example, ln(W) vs. ln(t) --> n is linear, and n>0, so define correlation = 'PearsonR(+)' to capture the positive linear relationship .
correlation defined in TCLR: {'PearsonR(+)','PearsonR(-)',''MIC','R2'} :
- 'PearsonR(+)' : to capture the positive linear relationship
- 'PearsonR(-)' : to capture the negative linear relationship
- 'MIC' : to capture non-linear relationship
- 'R2' : to capture non-linear relationship
定义变量之间的关联, 在本例中, ln(W) vs. ln(t) --> n 是线性的, 且 n>0, 故定义correlation = 'PearsonR(+)' 捕捉正相关线性关系。
TCLR 中定义 correlation : {'PearsonR(+)','PearsonR(-)',''MIC','R2'} :
- 'PearsonR(+)' : 捕捉正相关线性关系
- 'PearsonR(-)' : 捕捉负相关线性关系
- 'MIC' : 捕捉非线性关系
- 'R2' : 捕捉非线性关系
tolerance_list :
Defines the fluctuation range of the variable. Pass in the form [['feature_name1',tol_1],['feature_name2',tol_2]...].
e.g., tolerance_list = [['Fe',0.2],['Cr',0.1]]
'feature_name1' and 'feature_name2' must be consistent variable names, consistent with the variable names in the input file.
tol_1 = 0, i.e., on each leaf of TCLR, the variable 'feature_name1' has the same value, viz., 'feature_name1' is a constant
tol_1 = 1, i.e., on each leaf of TCLR, the variable 'feature_name1 can have different values, viz., there is no restriction on the value of 'feature_name1'
The setting of tolerance_list is to hope that the results obtained by TCLR are more explanatory on some specific issues.
minsize :
The minsize parameter determines the minimum number of data points required on a straight line in Figure 2, which is obtained by plotting the natural logarithm of the weight ln(W) against the natural logarithm of the time ln(t). To obtain an accurate slope n through the variable ln(W) vs. ln(t), minsize is used to limit the number of data points on the line. The default value of minsize is set to three, meaning that at least three different values of ln(t) are required on a straight line to calculate the slope n accurately.
每个叶子上最小的数据量。因为TCLR通过变量ln(W) vs. ln(t)得到斜率 n, minsize 通过限制直线上数据点的个数来得到准确的斜率 n。 默认minsize=3, 即图2(Introduction)中一条直线上最少有三个不同取值的时间ln(t)。
threshold :
The TCLR uses the threshold parameter to ensure that the data on each leaf is consistent with the same time exponent n.
When threshold is set to 1, all the data on the leaves must fall strictly on a straight line to be considered consistent. In this case, even a slight deviation from the line would be considered inconsistent.
However, in practice, the data is often noisy, and it may be difficult to fit a straight line perfectly. Therefore, the default value of threshold is set to threshold = 0.9. This value allows for some degree of deviation from the straight line while still maintaining consistency in the time index.
Setting the threshold parameter to a value that is too large, such as 0.99 or 0.98, can be problematic. In this case, a large amount of data may not meet the strict condition imposed by the threshold parameter, making it difficult to separate the data reasonably in the TCLR algorithm. This can result in poor performance and inaccurate results.
每个叶子上数据通过同一个时间指数n描述的一致性。threshold = 1, 则叶子上所有数据均严格地落在一条直线上。默认threshold = 0.9, 因为数据均是有噪声的, 很难理想的通过一条直线完美拟合。如果 threshold 太大,如(0.99,0.98), 会导致大量的数据难以满足此苛刻的条件,而无法在TCLR中被合理地分开。
mininc :
During the branching process in TCLR, the algorithm allows for a minimum gain in each branch. This ensures that the branching process only occurs when there is a significant improvement in the quality of the linear gain. The branching process in TCLR is guided by the principle of linear gain : see 🔗 TCLR.
TCLR分支过程中, 允许分支的最小增益。TCLR定义线性增益度来分枝生长, 见原理 🔗 TCLR.
在每次分枝中, 线性增益度的最小增加量为mininc, 默认mininc = 0.01
split_tol :
The split_tol is used in TCLR to control the rate at which features are brought together during the branching process. This parameter sets a minimum threshold for the convergence rate.
As TCLR grows and data is divided into smaller subsets in the feature space, the similarity between the data is expected to increase. In other word, we hope samples that have similar compositions and test conditions are divided into the same subset.
The split_tol constrains the speed at which features are brought together during the branching process. The default value for split_tol is 0.8, which means that after each branch, the change in the range of the features value is reduced to 80% of the original range. By controlling the rate at which features are brought together, the split_tol parameter helps to ensure that the resulting clusters are meaningful and useful for the intended application.
用来约束TCLR生长速度。在TCLR分枝过程中,各数据在特征空间被划分成不同小的子集。随着TCLR的生长,在小的数据集合里,数据被同一个时间指数n描述的更好。我们期望随着TCLR的划分, 数据之间的相似度增加 (成分和测试条件接近的样本被划分在同一个子集)。
通过split_tol约束特征靠拢的速度。特征值将以此最小速率靠拢。默认split_tol = 0.8, 即每次分枝后,特征值的变化区间缩小到原来的80%
gplearn :
The final result of TCLR is to obtain the hidden variable function relationship between the target feature and the response, viz., formula 3 (Introduction).
if gplearn=True
TCLR calls the symbol regression algorithm Gplearn, and returns the specific form of formula 3 according to the results of TCLR. Default `gplearn=False
gpl_dummyfea :
Pass in the form of ['feature_name1','feature_name2',...].
e.g., tolerance_list = ['Fe','Cr']
Default gpl_dummyfea=null. If passed in, features 'feature_name1','feature_name2' will not be used in gplearn regression
Others :
See Gplearn