Glow源码阅读

通过项目来学习C++是极好的方式,由于工作需求,对FB/Glow的代码有较为深入的研究,但很多C++的技巧和知识点看大概能看懂,却不是一上来就灵活运用,因此值得持续性挖掘。

日志

C++ in Glow

该部分主要是,结合Glow中的代码示例说明C++一些方法技巧,增强记忆。

虚函数

父类使用虚函数。

1
2
3
4
// Backend.h
virtual bool transformPostLowering(Function *F, CompilationMode mode) const {
return false;
}

子类则对方法重载。声明时使用override指示该方法进行了重载。

1
2
3
4
5
6
7
class CPUBackend : public Backend {
...
bool transformPostLowering(Function *F, CompilationMode mode) const override;
};
bool CPUBackend::transformPostLowering(Function *F, CompilationMode mode) const{
...
}

以下来自一篇写得很清晰的文章
1) 公有继承
纯虚函数 => 继承的是:接口 (interface)
普通虚函数 => 继承的是:接口 + 缺省实现 (default implementation)
非虚成员函数 => 继承的是:接口 + 强制实现 (mandatory implementation)

对于声明为非虚成员函数,继承时最好不要重写;
对于普通虚函数,要重写时需要满足苛刻的条件(返回类型,常量属性,引用限定符等)。使用override关键字,可以强制编译器检查该函数是否正确重写。
其中,这三种函数在父类中声明示例如下:

1
2
3
4
5
6
class Shape {
public:
virtual void Draw() const = 0; // 1) 纯虚函数
virtual void Error(const string& msg); // 2) 普通虚函数
int ObjectID() const; // 3) 非虚函数
};

std::unique_ptr, std::move等

先上代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class OpenCLFunction final : public CompiledFunction {
...
std::unique_ptr<IRFunction> F_;
public:
/// Ctor.
explicit OpenCLFunction(std::unique_ptr<IRFunction> F);
};

OpenCLFunction::OpenCLFunction(std::unique_ptr<IRFunction> F)
: F_(std::move(F)) {
...
}

std::unique_ptr<CompiledFunction>
OCLBackend::compile(std::unique_ptr<IRFunction> IR) const {
return llvm::make_unique<OpenCLFunction>(std::move(IR));
}

unique_ptr 独占所指向的对象, 同一时刻只能有一个 unique_ptr 指向给定对象(通过禁止拷贝语义, 只有移动语义来实现), 定义于 memory (非memory.h)中, 命名空间为 std。

这是一篇很全面的文章。C++ 11 创建和使用 unique_ptr
要点是:

  1. unique_ptr的构造主要有以下方式。

    1
    2
    std::unique_ptr p = std::make<ClassName>();
    std::unique_ptr p(new ClassName());
  2. unique_ptr可以实现 返回函数内分配的动态资源

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    unique_ptr<int> Func(int p)
    {
    unique_ptr<int> pInt(new int(p)); //构造方式
    return pInt; // 返回unique_ptr
    }

    int main() {
    int p = 5;
    unique_ptr<int> ret = Func(p);
    cout << *ret << endl;
    // 函数结束后,自动释放资源
    }
  3. 自动释放资源。避免 new delete方式因抛出异常等原因,无法正常释放。

  4. 可以使用移动构造和移动赋值,而不能使用复制构造和赋值。’unique_ptr p2(std::move(p1)’或者unique_ptr<T> p2 = std::move(p1);

std::move
C++14
template< class T >
constexpr typename std::remove_reference::type&& move( T&& t ) noexcept;
std::move的参数和返回值都是右值引用。在Glow中有例子如下:

1
2
3
4
5
6
7
8
9
10
void Context::insert(Placeholder *P, Tensor &&T) {
assert(!map_.count(P) && "Placeholder already registered");
// Take ownership over the tensor.
map_[P] = new Tensor(std::move(T));
nameMap_[P->getName()] = P;
}

// somewhere
Tensor ptrT = orig->getUnowned(orig->dims());
insert(placeholders[i], std::move(ptrT));

lambda function

回调函数

1
2
3
4
5
6
7
8
/// Callback type used by HostManager and DeviceManager, used to pass results of an inference request back to the caller.
using ResultCBTy = std::function<void(
runtime::RunIdentifierTy, runtime::ResultCode, std::unique_ptr<Context>)>;

void CPUDeviceManager::runFunctionImpl(RunIdentifierTy id, std::string function, std::unique_ptr<Context> ctx, ResultCBTy resultCB) {
...
resultCB(id, ResultCode::Executed, std::move(ctx));
}

调用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/// Starts a run of resnet50 on the given image. The image must be already
/// loaded into the input placeholder in /p ctx.
/// If, at the end of the run the number of \p returned results is equal to
/// maxImages, the \p finished promise is set.
void dispatchClassify(unsigned int id, DeviceManager *device, std::string path,
Placeholder *output, std::unique_ptr<Context> ctx,
std::atomic<size_t> &returned,
std::promise<void> &finished) {
device->runFunction(
"resnet50", std::move(ctx),
[id, path, output, &returned, &finished](RunIdentifierTy, ResultCode r,
std::unique_ptr<Context> ctx) {
size_t maxIdx = ctx->get(output)->getHandle<>().minMaxArg().second;

llvm::outs() << "(" << id << ") " << path << ": " << maxIdx << "\n";
if (++returned == maxImages) {
finished.set_value();
}
});
}

lambda expressions in C++

  1. capture clause (Also known as the lambda-introducer in the C++ specification.)
    在上面的代码示例中,lambda函数体中除了参数,还使用了id,path等来自surrounding scope的变量,即相对于lambda函数的外部变量。
    capture的参数分为[&](captured-by-reference)和[=](captured-by-value)。
    by-value是默认方式。假定外部变量total以引用的方式,factor则以值方式,以下方式等价。
    [&total, factor]
    [&, factor]
    [=, &total]
    …省略了交换顺序的情况
  2. parameter list Optional. (Also known as the lambda declarator)

  3. mutable specification Optional.

  4. exception-specification Optional.

  5. trailing-return-type Optional.

  6. lambda body.
    另外一个例子。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    // captures_lambda_expression.cpp
    // compile with: /W4 /EHsc
    #include <iostream>
    using namespace std;

    int main()
    {
    int m = 0;
    int n = 0;
    [&, n] (int a) mutable { m = ++n + a; }(4);
    cout << m << endl << n << endl;
    }
    /*
    * 5
    * 0
    */

Glow 工程架构

ExecutionEngine编译流程

v1

1
2
3
void ExecutionEngine::compile(CompilationMode mode, Function *F) {
function_ = backend_->compile(generateIR(mode, F));
}

generateIR()做了大量的工作。

  • 图(class Function)优化
  • 图Lowering (backend 相关的)
  • pre/post-Lowering的图处理
  • IR(IRFunction)生成
  • IR 优化

function_的类型是class CompiledFunction,其实是一个接口类。

  • EE生成IR,并让backend_对IR进一步编译处理;
  • EE在运行图时,调用的是function_->execute(),该函数是class CompiledFunction的纯虚函数;
  • CPU后端的相关的CPUFunction,OpenCL相关的OpenCLFunction等,都是继承自CompliedFunction。
    在代码形式上,函数返回类型为Base *, 实际返回则是Derived *。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    class OCLBackend final : public Backend {
    public:
    ...
    std::unique_ptr<CompiledFunction>
    compile(std::unique_ptr<IRFunction> IR) const override;

    };
    std::unique_ptr<CompiledFunction>
    OCLBackend::compile(std::unique_ptr<IRFunction> IR) const {
    return llvm::make_unique<OpenCLFunction>(std::move(IR));
    }

v2

PostOrderVisitor

Glow的图遍历采用了visitor模式。
ClassName##Node内部均定义了visit()方法;
Visitor选择实现pre/post方法来选择先序后序遍历。

Node的visit方法如下,做了简略:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void XXvisit(Node *parent, NodeWalker *visitor)
{
visitor->pre(parent, this);

for (const auto &op : nodeInputs_) {
op.getNode()->visit(this, visitor);
}

for (const auto &op : members_) {
if (op.first == MemberType::VectorNodeValue) {
for (auto &I : << op.second)
{
I.getNode()->visit(this, visitor);
}
}
}

visitor->post(parent, this);
}

由于搞反了父子节点规定和root节点,笔者一直以为post order得到的顺序是,正常计算顺序的反序。
例如,mnist网络计算图(图太大,放链接)
计算的顺序是
conv-pool2-relu311-conv1-…-fc_1X-fc_dot-fc_add_bias-sm-return
这个计算图实际上是layer堆叠的,因此结构并不复杂。
按照构造顺序,我们容易觉得conv是pool2的父节点;然而,按照一般的计算图的定义方式。最后的return节点才是父节点,即从底向上的计算。因此,PostOrder的遍历,其实恰是正常计算的顺序。

下面具体介绍一下Glow中的ChildMemSizeBasedScheduler。
按照上面的定义,节点的输入是其计算图中的子节点。以Add节点为例,C = A + B。
是先算ANode还是BNode呢,衡量指标是Metric[child] = maxMemSize_[child] - resultMemSize_[child],较大者优先。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// pesudo code
void orderChildNodesAndSchedule(Node *N){
1. if N is Scheduled or N is Variable(weight input for op node, not activation)
return;

2. for input of N
children.push_back(input)

3. order children by Metric desn

4. for child in children
orderChildNodesAndSchedule(child)

5. scheduled_.push_back(N)
}

其中节点N的MaxMemSize应该这样规定:
max(sum(resultMemSize of children), max(maxMemSize of children))
具体的数据结构和算法如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/// Required number of bytes to hold the results of a given node.
std::unordered_map<const Node *, int64_t> resultMemSize_;
/// Max number of bytes required during the computation of a given node.
std::unordered_map<const Node *, int64_t> maxMemSize_;

void ChildMemSizeBasedScheduler::computeNodeComputationMaxMemorySize() {
// Traverse nodes in such a way, that dependnecies are processed
// before the node using them.
GraphPostOrderVisitor visitor(G_);
for (auto *N : visitor.getPostOrder()) {
int64_t maxSize = (N->getNumInputs() > 0)
? std::max(resultMemSize_[N->getNthInput(0)],
maxMemSize_[N->getNthInput(0)])
: 0;
for (size_t idx = 1, e = N->getNumInputs(); idx < e; ++idx) {
const auto &input = N->getNthInput(idx);
// Skip operands that do not require memory allocations for storing
// their results.
if (isa<Storage>(input))
continue;
assert(resultMemSize_.count(input) > 0);
assert(maxMemSize_.count(input) > 0);
maxSize += resultMemSize_[input];
if (maxSize < maxMemSize_[input])
maxSize = maxMemSize_[input];
}
maxMemSize_[N] = maxSize;
}
}

DeviceManager设计

当我们考虑分布式执行时,需要将Devices管理起来,需要统一的接口,并考虑和计算图之间的交互问题,包括执行和代码生成等角度。

Variable拆分为Constants和Placeholder

较早版本的glow
用Variable表示作为Weight的容器,如input,Conv的filter和bias等。这些Weights又可以分为两类,在会变动的MutableWeights(输入,训练),和ConstantWeights(如Zero-Tensor和推理时的Weights)。Glow的实现是在IRGen时将所有Variable对应的WeightVar设置为Mutable,在IROptimizer运行时,将无需写的WeightVar属性设置为Constans。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// IRGen
case glow::Kinded::Kind::VariableKind: {
auto *V = cast<Variable>(N);
auto *W = builder_.createWeightVar(V->getType(), V->getName(),
WeightVar::MutabilityKind::Mutable,
V->getVisibilityKind());
W->setName(N->getName());
registerIR(N, W);
break;
}

// IROptimizer
// For each instruction that uses the weight:
for (const auto &U : ValueUses(W)) {
auto kind = U.getOperand().second;
// Check if all of the users are read-only.
if (kind != OperandKind::In) {
readOnly = false;
break;
}
}

// Mark the variable as read only.
if (readOnly)
W->setMutability(WeightVar::MutabilityKind::Constant);

备注:Varialb的Visibility属性,设置为Public(意味着可能hold a reference)时不可对其进行优化,且对应的WeightVar必须是Mutable的。

Variable的成员变量如下:

1
2
3
4
5
6
7
8
9
10
11
class Variable : public Node {
/// Specifies if the variable is trainable.
bool isTrainable_;
/// Specifies the visibility of the variable.
VisibilityKind visibility_;
/// The tensor payload that the variable holds.
Tensor payload_;
...
public:
Tensor &getPayload() { return payload_; }
};

Placeholder和Constant
对比两个类的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class Storage : public Node {
public:
... // some functions
};

class Constant : public Storage{
/// The tensor payload that the constant holds.
Tensor payload_;
public:
...
};

/// Placeholder nodes are unbound-storage. The content tensors are attached to
/// this node at runtime. Placeholders are used as inputs and output nodes to
/// the network.
class Placeholder : public Storage {
/// Specifies if the placeholder is trainable.
bool isTrainable_;
public:
/// Create a new placeholder.
Placeholder(llvm::StringRef name, TypeRef Ty, bool isTrainable)
: Storage(Kinded::Kind::PlaceholderKind, name),
isTrainable_(isTrainable) {
addResult(Ty);
}
...
};

在应用方面,以lenet mnist网络为例,构造网络时input是Placeholder,Conv的filte和bias同样声明为Placeholder。(注意Placeholder构造函数中的addResult(Ty))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// mnist.cpp
Placeholder *A = mod.createPlaceholder(
ElemKind::FloatTy, {minibatchSize, 28, 28, 1}, "input", false);
...
Tensor *inputTensor = ctx.allocate(A);

// Context.cpp
Tensor *Context::allocate(Placeholder *P) {
assert(!map_.count(P) && "Placeholder already registered");
Tensor *T = new Tensor(P->getType());
map_[P] = T;
nameMap_[P->getName()] = P;
return T;
}

Context

Context用来维护神经网络的什么信息呢?

  • mapping between some graph nodes and concrete tensors
  • traceEvents
  • owns the tensors 1)tensors是在成员函数allocate()中分配的;2)insert(Placeholder *P, Tensor &&T)使用了右值引用和std::move()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/// This class provides a mapping between some graph nodes, which are a symbolic
/// representation of some computation, and concrete tensors that represent the
/// inputs and outputs to the graph. The context owns the tensors and the graph
/// uses these values as runtime. This is useful for the multi-threaded
/// execution of code, where each thread has a different execution context. The
/// difference between this class and a regular map is that the Context owns the
/// Tensors (not only the pointers) and manages their lifetime.
class Context final {
public:
/// Maps placeholders to the tensors that back them.
using PlaceholderMap = std::unordered_map<Placeholder *, Tensor *>;

/// Maps Placeholder names to Placeholders.
using PlaceholderNameMap = std::unordered_map<std::string, Placeholder *>;

private:
/// Maps Placeholders to Tensors.
PlaceholderMap map_;

/// Maps Placeholder names to Placeholders.
PlaceholderNameMap nameMap_;

/// Trace Events recorded during this run.
std::vector<TraceEvent> traceEvents_;

Context::insert()的使用实例如下:

1
2
3
4
5
// ExecutorTest.cpp
auto refCtx = llvm::make_unique<Context>();
auto *tensor = testCtx->allocate(placeholder.get());
tensor->init(Tensor::InitKind::Xavier, 1.0, rng);
refCtx->insert(placeholder.get(), tensor->clone());

Backends

class Backend
接口类,重要函数如下
compile():
virtual std::unique_ptr compile(Function *F) = 0;

class BackendUsingGlowIR : public Backend
compileIR(): 增加了从IRFunction编译到CompiledFunction的接口compileIR()(在LLVMBackend中实现)
virtual std::unique_ptr<CompiledFunction> compileIR(std::unique_ptr<IRFunction> IR) = 0;

class LLVMBackend : public BackendUsingGlowIR
compile()的实现。
compileIR()的实现。

createIRGen(): 构造LLVM IR Generator的接口。(具体的后端应该做平台相关的实现)
virtual std::unique_ptr<LLVMIRGen> createIRGen(const IRFunction *IR, AllocationsInfo &allocationsInfo) const = 0;

createCompliedFunction(): 构造CompiledFunction的接口(在CPUBackend中实现)
virtual std::unique_ptr<CompiledFunction> createCompiledFunction(std::unique_ptr<llvm::orc::GlowJIT> JIT, const runtime::RuntimeBundle &runtimeBundle) const = 0;

class CPUBackend : public LLVMBackend
createIRGen()的实现。
提供了对LLVMBackend中接口的实现。
compile(Function )
generateAndOptimizeIR(Function
)
compileIR(IRFunction *)
createIRGen()
glow::generateRuntimeBundle()
createCompiledFunction()


class CompiledFunction

1
2
3
4
5
6
7
8
9
10
11
12
// execute()是最为重要的接口,CompiledFunction中没有编译后的底层IR,需要其继承类来实现。
// RuntimeBundle中保存了SymbolTable和内存大小信息(constant,
// placeholder, activation, 独立的Allocator分配,故偏移量独立);
// 如果collectConstants,还有分配用来聚合的constants空间。
class CompiledFunction {
public:
virtual void execute(Context *ctx) = 0;
...
protected:
runtime::RuntimeBundle runtimeBundle_;
...
};

存储管理

CPUBackend

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
void LLVMCompiledFunction::execute(Context *ctx) {
uint8_t *baseActivationsAddress{nullptr};

/// Base address for Mutable weights memory block, Inputs and Outputs.
uint8_t *baseMutableWeightVarsAddress{nullptr};

if (runtimeBundle_.getActivationsSize() != 0) {
baseActivationsAddress = (uint8_t *)alignedAlloc(
runtimeBundle_.getActivationsSize(), TensorAlignment);
}

if (runtimeBundle_.getMutableWeightSize() != 0) {
baseMutableWeightVarsAddress = (uint8_t *)alignedAlloc(
runtimeBundle_.getMutableWeightSize(), TensorAlignment);
}

loadPlaceholders(ctx, baseMutableWeightVarsAddress);

auto sym = JIT_->findSymbol("jitmain");
assert(sym && "Unable to JIT the code!");
using JitFuncType =
void (*)(uint8_t * constantWeightVars, uint8_t * mutableWeightVars,
uint8_t * activations);
auto address = sym.getAddress();
if (address) {
JitFuncType funcPtr = reinterpret_cast<JitFuncType>(address.get());
funcPtr(runtimeBundle_.getConstants(), baseMutableWeightVarsAddress,
baseActivationsAddress);
} else {
GLOW_UNREACHABLE("Error getting address");
}

updatePlaceholders(ctx, baseMutableWeightVarsAddress);

alignedFree(baseMutableWeightVarsAddress);
alignedFree(baseActivationsAddress);

translateTraceEvents(ctx);
}

Glow中间activations Tensor维度推导

我们以Convolution为例,在ConvolutionNode *Function::createConv(...)函数中,计算输出维度并作为ConvolutionNode的构造函数参数。

1
2
3
4
5
auto outSz = calculateConvPoolOutputDims(idim.h, idim.w, kernels, strides, pads);
...
auto OT = getParent()->uniqueType(ElemKind::FloatTy, outDims);
return addNode(new ConvolutionNode(name, OT, input, filter, bias, kernels,
strides, pads, group));

其中uniqueType的返回类型是TypeRef.using TypeRef = const Type *;

1
TypeRef Module::uniqueType(ElemKind elemTy, llvm::ArrayRef<size_t> dims);