DolphinDB机器学习函数：内置ML能力（学习笔记）

整理一篇学习笔记，把看到的一些要点和自己的理解都记下来。

摘要

本文深入讲解DolphinDB内置机器学习函数。从回归分析到分类模型，从聚类算法到时间序列预测，从特征工程到模型评估，全面介绍机器学习函数的核心功能。通过丰富的代码示例，帮助读者掌握内置ML能力的核心技能。

一、机器学习概述

1.1 DolphinDB ML能力

DolphinDB ML

回归分析

线性回归

分类模型

逻辑回归

聚类算法

K-Means

时间序列

ARIMA

特点

内置函数

向量化加速

分布式计算

1.2 内置ML函数

类别函数说明回归ols普通最小二乘回归分类logisticRegression逻辑回归分类聚类kmeansK-Means聚类预测arimaARIMA时间序列预测

1.3 适用场景

场景说明预测性维护设备故障预测质量控制质量预测分析能耗预测能耗趋势预测异常检测数据异常识别

二、回归分析

2.1 线性回归

// 创建数据
n = 1000
x1 = rand(10.0, n)
x2 = rand(20.0, n)
y = 2 * x1 + 3 * x2 + rand(-1.0..1.0, n)

t = table(x1, x2, y)

// 线性回归
result = ols(y, [x1, x2])

// 查看结果
result

// 系数解释：
// Intercept: 截距
// x1: x1的系数（接近2）
// x2: x2的系数（接近3）

2.2 回归预测

// 使用回归模型预测
// 创建新数据
newX1 = rand(10.0, 100)
newX2 = rand(20.0, 100)

// 预测
predictions = result.Intercept +
              result.Coefficient[0] * newX1 +
              result.Coefficient[1] * newX2

// 或者使用矩阵运算
newX = matrix([newX1, newX2])
predictions = newX ** result.Coefficient + result.Intercept

2.3 多项式回归

// 多项式回归
x = rand(10.0, 1000)
y = 2 * x + 3 * x * x + rand(-1.0..1.0, 1000)

// 创建多项式特征
x2 = x * x

// 多项式回归
result = ols(y, [x, x2])

2.4 回归评估

// 回归评估指标
def evaluateRegression(actual, predicted) {
    // R²
    ssRes = sum((actual - predicted) ^ 2)
    ssTot = sum((actual - avg(actual)) ^ 2)
    r2 = 1 - ssRes / ssTot

    // RMSE
    rmse = sqrt(avg((actual - predicted) ^ 2))

    // MAE
    mae = avg(abs(actual - predicted))

    return dict(STRING, DOUBLE, [
        ["R2", r2],
        ["RMSE", rmse],
        ["MAE", mae]
    ])
}

// 使用
predictions = result.Intercept + result.Coefficient[0] * x1 + result.Coefficient[1] * x2
evaluateRegression(y, predictions)

三、分类模型

3.1 逻辑回归

// 创建分类数据
n = 1000
x1 = rand(10.0, n)
x2 = rand(10.0, n)
y = iif(x1 + x2 > 10, 1, 0)

t = table(x1, x2, y)

// 逻辑回归
result = logisticRegression(y, [x1, x2])

// 查看结果
result

3.2 分类预测

// 预测概率
prob = 1 / (1 + exp(-(result.Intercept +
                       result.Coefficient[0] * x1 +
                       result.Coefficient[1] * x2)))

// 预测类别
predicted = iif(prob > 0.5, 1, 0)

3.3 分类评估

// 分类评估指标
def evaluateClassification(actual, predicted) {
    // 混淆矩阵
    tp = sum(actual == 1 and predicted == 1)
    tn = sum(actual == 0 and predicted == 0)
    fp = sum(actual == 0 and predicted == 1)
    fn = sum(actual == 1 and predicted == 0)

    // 准确率
    accuracy = (tp + tn) / (tp + tn + fp + fn)

    // 精确率
    precision = tp / (tp + fp)

    // 召回率
    recall = tp / (tp + fn)

    // F1分数
    f1 = 2 * precision * recall / (precision + recall)

    return dict(STRING, DOUBLE, [
        ["Accuracy", accuracy],
        ["Precision", precision],
        ["Recall", recall],
        ["F1", f1]
    ])
}

// 使用
evaluateClassification(y, predicted)

四、聚类分析

4.1 K-Means聚类

// 创建聚类数据
n = 300
x1 = concat([rand(5.0, 100), rand(15.0, 100), rand(25.0, 100)])
x2 = concat([rand(5.0, 100), rand(15.0, 100), rand(25.0, 100)])

// K-Means聚类
result = kmeans(matrix([x1, x2]), 3)

// 查看结果
result

// 聚类中心
result.centers

// 聚类标签
result.cluster

4.2 聚类可视化

// 聚类结果
t = table(x1, x2, result.cluster as cluster)

// 查看各簇统计
select cluster, count(*) as cnt,
       avg(x1) as avg_x1,
       avg(x2) as avg_x2
from t
group by cluster

4.3 聚类评估

// 聚类评估：轮廓系数
def silhouetteScore(data, labels) {
    n = data.rows()
    scores = array(DOUBLE, n)

    for (i in 0..n) {
        // 计算簇内距离
        sameCluster = labels == labels[i]
        a = avg(abs(data[sameCluster] - data[i]))

        // 计算最近簇距离
        otherClusters = unique(labels[labels != labels[i]])
        b = min(each(def(c) {
            avg(abs(data[labels == c] - data[i]))
        }, otherClusters))

        scores[i] = (b - a) / max(a, b)
    }

    return avg(scores)
}

五、时间序列预测

5.1 ARIMA模型

// 创建时间序列数据
n = 200
t = table(
    1..n as time,
    100 + 0.1 * (1..n) + 10 * sin(2 * pi * (1..n) / 12) + rand(-5.0..5.0, n) as value
)

// ARIMA预测
result = arima(t.value, 1, 1, 1)  // ARIMA(1,1,1)

// 查看结果
result

5.2 时间序列预测

// 预测未来值
forecastSteps = 10
forecast = arimaForecast(result, forecastSteps)

// 预测结果
print("未来" + string(forecastSteps) + "期预测值:")
print(forecast)

5.3 时间序列分解

// 时间序列分解
// 趋势：移动平均
trend = mavg(t.value, 12)

// 季节性：去趋势后的周期平均
detrended = t.value - trend
seasonal = avg(detrended)  // 简化处理

// 残差
residual = t.value - trend - seasonal

// 结果
select time, value, trend, seasonal, residual
from t

六、特征工程

6.1 特征缩放

// 特征缩放
def normalize(data) {
    return (data - min(data)) / (max(data) - min(data))
}

def standardize(data) {
    return (data - avg(data)) / std(data)
}

// 使用
x = rand(100.0, 1000)
normalize(x)
standardize(x)

6.2 特征编码

// 类别编码
def oneHotEncode(categories) {
    uniqueVals = distinct(categories)
    n = categories.size()
    m = uniqueVals.size()

    result = matrix(INT, n, m, 0)
    for (i in 0..n) {
        j = which(uniqueVals == categories[i])
        result[i, j] = 1
    }
    return result
}

// 使用
categories = take(`A`B`C, 100)
oneHotEncode(categories)

6.3 特征选择

// 特征选择：相关性分析
def correlationFilter(features, target, threshold = 0.1) {
    correlations = each(def(f) { corr(f, target) }, features)
    return abs(correlations) > threshold
}

// 使用
x1 = rand(10.0, 1000)
x2 = rand(10.0, 1000)
x3 = rand(10.0, 1000)
y = 2 * x1 + rand(-1.0..1.0, 1000)  // x1与y相关

correlationFilter([x1, x2, x3], y)
// 结果：[true, false, false]

七、实战案例

7.1 设备故障预测

// ========== 设备故障预测 ==========

// 创建设备数据
n = 10000
t = table(
    1..n as device_id,
    rand(1000.0, n) as vibration,      // 振动
    rand(100.0, n) as temperature,     // 温度
    rand(50.0, n) as pressure,         // 压力
    rand(1000.0, n) as runtime,        // 运行时间
    iif(rand(100.0, n) > 90, 1, 0) as failure  // 故障标签
)

// 特征
features = [t.vibration, t.temperature, t.pressure, t.runtime]

// 逻辑回归预测
model = logisticRegression(t.failure, features)

// 预测
prob = 1 / (1 + exp(-(model.Intercept +
                       model.Coefficient[0] * t.vibration +
                       model.Coefficient[1] * t.temperature +
                       model.Coefficient[2] * t.pressure +
                       model.Coefficient[3] * t.runtime)))

predicted = iif(prob > 0.5, 1, 0)

// 评估
evaluateClassification(t.failure, predicted)

7.2 能耗预测

// ========== 能耗预测 ==========

// 创建能耗数据
n = 365
t = table(
    2024.01.01 + 0..(n-1) as date,
    rand(1000.0..2000.0, n) as energy,
    rand(10.0..35.0, n) as temperature,
    rand(0..1, n) as is_workday
)

// 特征：温度、是否工作日
features = [t.temperature, double(t.is_workday)]

// 线性回归
model = ols(t.energy, features)

// 预测
predictions = model.Intercept +
              model.Coefficient[0] * t.temperature +
              model.Coefficient[1] * double(t.is_workday)

// 评估
evaluateRegression(t.energy, predictions)

八、总结

本文详细介绍了DolphinDB机器学习函数：

1. 回归分析：线性回归、多项式回归、评估指标
2. 分类模型：逻辑回归、分类预测、评估指标
3. 聚类分析：K-Means聚类、聚类评估
4. 时间序列：ARIMA模型、时间序列预测
5. 特征工程：特征缩放、特征编码、特征选择
6. 实战应用：故障预测、能耗预测

思考题：

1. 如何选择合适的机器学习模型？
2. 如何评估模型性能？
3. 特征工程有什么重要性？

参考资料

本次分享就到这里。技术这东西越研究越有意思，后续有新的收获我也会继续更新。

小丸子博客