Glow源码阅读

通过项目来学习C++是极好的方式，由于工作需求，对FB/Glow的代码有较为深入的研究，但很多C++的技巧和知识点看大概能看懂，却不是一上来就灵活运用，因此值得持续性挖掘。

日志

C++ in Glow

该部分主要是，结合Glow中的代码示例说明C++一些方法技巧，增强记忆。

虚函数

父类使用虚函数。

// Backend.h
virtual bool transformPostLowering(Function *F, CompilationMode mode) const {
return false;
}

子类则对方法重载。声明时使用override指示该方法进行了重载。

class CPUBackend : public Backend {
    ...
    bool transformPostLowering(Function *F, CompilationMode mode) const override;
};
bool CPUBackend::transformPostLowering(Function *F, CompilationMode mode) const{
    ...
}

以下来自一篇写得很清晰的文章
1) 公有继承
纯虚函数 => 继承的是：接口 (interface)
普通虚函数 => 继承的是：接口 + 缺省实现 (default implementation)
非虚成员函数 => 继承的是：接口 + 强制实现 (mandatory implementation)

对于声明为非虚成员函数，继承时最好不要重写；
对于普通虚函数，要重写时需要满足苛刻的条件（返回类型，常量属性，引用限定符等）。使用override关键字，可以强制编译器检查该函数是否正确重写。
其中，这三种函数在父类中声明示例如下：

class Shape {
public:
    virtual void Draw() const = 0;    // 1) 纯虚函数
    virtual void Error(const string& msg);  // 2) 普通虚函数
    int ObjectID() const;  // 3) 非虚函数
};

std::unique_ptr, std::move等

先上代码。

class OpenCLFunction final : public CompiledFunction {
    ...
    std::unique_ptr<IRFunction> F_;
public:
    /// Ctor.
    explicit OpenCLFunction(std::unique_ptr<IRFunction> F);
};

OpenCLFunction::OpenCLFunction(std::unique_ptr<IRFunction> F)
    : F_(std::move(F)) {
    ...
}

std::unique_ptr<CompiledFunction>
OCLBackend::compile(std::unique_ptr<IRFunction> IR) const {
  return llvm::make_unique<OpenCLFunction>(std::move(IR));
}

unique_ptr 独占所指向的对象, 同一时刻只能有一个 unique_ptr 指向给定对象(通过禁止拷贝语义, 只有移动语义来实现), 定义于 memory (非memory.h)中, 命名空间为 std。

这是一篇很全面的文章。C++ 11 创建和使用 unique_ptr
要点是：

unique_ptr的构造主要有以下方式。

1 2	std::unique_ptr p = std::make<ClassName>(); std::unique_ptr p(new ClassName());

unique_ptr可以实现返回函数内分配的动态资源

unique_ptr<int> Func(int p)
{
    unique_ptr<int> pInt(new int(p)); //构造方式
    return pInt;    // 返回unique_ptr
}

int main() {
    int p = 5;
    unique_ptr<int> ret = Func(p);
    cout << *ret << endl;
    // 函数结束后，自动释放资源
}

自动释放资源。避免 new delete方式因抛出异常等原因，无法正常释放。
可以使用移动构造和移动赋值，而不能使用复制构造和赋值。’unique_ptr p2(std::move(p1)’或者unique_ptr<T> p2 = std::move(p1);

std::move
C++14
template< class T >
constexpr typename std::remove_reference::type&& move( T&& t ) noexcept;
std::move的参数和返回值都是右值引用。在Glow中有例子如下：

void Context::insert(Placeholder *P, Tensor &&T) {
  assert(!map_.count(P) && "Placeholder already registered");
  // Take ownership over the tensor.
  map_[P] = new Tensor(std::move(T));
  nameMap_[P->getName()] = P;
}

// somewhere
Tensor ptrT = orig->getUnowned(orig->dims());
insert(placeholders[i], std::move(ptrT));

lambda function

回调函数

/// Callback type used by HostManager and DeviceManager, used to pass results of an inference request back to the caller.
using ResultCBTy = std::function<void(
    runtime::RunIdentifierTy, runtime::ResultCode, std::unique_ptr<Context>)>;

void CPUDeviceManager::runFunctionImpl(RunIdentifierTy id, std::string function, std::unique_ptr<Context> ctx, ResultCBTy resultCB) {
    ...
    resultCB(id, ResultCode::Executed, std::move(ctx));
}

调用

/// Starts a run of resnet50 on the given image. The image must be already
/// loaded into the input placeholder in /p ctx.
/// If, at the end of the run the number of \p returned results is equal to
/// maxImages, the \p finished promise is set.
void dispatchClassify(unsigned int id, DeviceManager *device, std::string path,
                      Placeholder *output, std::unique_ptr<Context> ctx,
                      std::atomic<size_t> &returned,
                      std::promise<void> &finished) {
  device->runFunction(
      "resnet50", std::move(ctx),
      [id, path, output, &returned, &finished](RunIdentifierTy, ResultCode r,
                                               std::unique_ptr<Context> ctx) {
        size_t maxIdx = ctx->get(output)->getHandle<>().minMaxArg().second;

        llvm::outs() << "(" << id << ") " << path << ": " << maxIdx << "\n";
        if (++returned == maxImages) {
          finished.set_value();
        }
      });
}

lambda expressions in C++

capture clause (Also known as the lambda-introducer in the C++ specification.)
在上面的代码示例中，lambda函数体中除了参数，还使用了id，path等来自surrounding scope的变量，即相对于lambda函数的外部变量。
capture的参数分为[&](captured-by-reference)和[=](captured-by-value)。
by-value是默认方式。假定外部变量total以引用的方式，factor则以值方式，以下方式等价。
[&total, factor]
[&, factor]
[=, &total]
…省略了交换顺序的情况
parameter list Optional. (Also known as the lambda declarator)
mutable specification Optional.
exception-specification Optional.
trailing-return-type Optional.

lambda body.
另外一个例子。

// captures_lambda_expression.cpp
// compile with: /W4 /EHsc
#include <iostream>
using namespace std;

int main()
{
   int m = 0;
   int n = 0;
   [&, n] (int a) mutable { m = ++n + a; }(4);
   cout << m << endl << n << endl;
}
/*
* 5
* 0
*/

Glow 工程架构

ExecutionEngine编译流程

1
2
3

void ExecutionEngine::compile(CompilationMode mode, Function *F) {
  function_ = backend_->compile(generateIR(mode, F));
}

generateIR()做了大量的工作。

图(class Function)优化
图Lowering (backend 相关的)
pre/post-Lowering的图处理
IR(IRFunction)生成
IR 优化

function_的类型是class CompiledFunction，其实是一个接口类。

EE生成IR，并让backend_对IR进一步编译处理；
EE在运行图时，调用的是function_->execute()，该函数是class CompiledFunction的纯虚函数；

CPU后端的相关的CPUFunction，OpenCL相关的OpenCLFunction等，都是继承自CompliedFunction。
在代码形式上，函数返回类型为Base *，实际返回则是Derived *。

class OCLBackend final : public Backend {
public:
    ...
    std::unique_ptr<CompiledFunction>
    compile(std::unique_ptr<IRFunction> IR) const override;

};
std::unique_ptr<CompiledFunction>
OCLBackend::compile(std::unique_ptr<IRFunction> IR) const {
  return llvm::make_unique<OpenCLFunction>(std::move(IR));
}

PostOrderVisitor

Glow的图遍历采用了visitor模式。
ClassName##Node内部均定义了visit()方法；
Visitor选择实现pre/post方法来选择先序后序遍历。

Node的visit方法如下,做了简略：

void XXvisit(Node *parent, NodeWalker *visitor)
{
    visitor->pre(parent, this);

    for (const auto &op : nodeInputs_) {
        op.getNode()->visit(this, visitor);
    }

    for (const auto &op : members_) {
        if (op.first == MemberType::VectorNodeValue) {
            for (auto &I :  << op.second)
            {
                I.getNode()->visit(this, visitor);
            }
        }
    }

    visitor->post(parent, this);
}

由于搞反了父子节点规定和root节点，笔者一直以为post order得到的顺序是，正常计算顺序的反序。
例如，mnist网络计算图(图太大，放链接)
计算的顺序是
conv-pool2-relu311-conv1-…-fc_1X-fc_dot-fc_add_bias-sm-return
这个计算图实际上是layer堆叠的，因此结构并不复杂。
按照构造顺序，我们容易觉得conv是pool2的父节点；然而，按照一般的计算图的定义方式。最后的return节点才是父节点，即从底向上的计算。因此，PostOrder的遍历，其实恰是正常计算的顺序。

下面具体介绍一下Glow中的ChildMemSizeBasedScheduler。
按照上面的定义，节点的输入是其计算图中的子节点。以Add节点为例，C = A + B。
是先算ANode还是BNode呢，衡量指标是Metric[child] = maxMemSize_[child] - resultMemSize_[child]，较大者优先。

// pesudo code
void orderChildNodesAndSchedule(Node *N){
    1. if N is Scheduled or N is Variable(weight input for op node, not activation)
        return;

    2. for input of N
        children.push_back(input)

    3. order children by Metric desn

    4. for child in children
        orderChildNodesAndSchedule(child)

    5. scheduled_.push_back(N)
}

其中节点N的MaxMemSize应该这样规定：
max(sum(resultMemSize of children), max(maxMemSize of children))
具体的数据结构和算法如下

/// Required number of bytes to hold the results of a given node.
std::unordered_map<const Node *, int64_t> resultMemSize_;
/// Max number of bytes required during the computation of a given node.
std::unordered_map<const Node *, int64_t> maxMemSize_;

void ChildMemSizeBasedScheduler::computeNodeComputationMaxMemorySize() {
  // Traverse nodes in such a way, that dependnecies are processed
  // before the node using them.
  GraphPostOrderVisitor visitor(G_);
  for (auto *N : visitor.getPostOrder()) {
    int64_t maxSize = (N->getNumInputs() > 0)
                          ? std::max(resultMemSize_[N->getNthInput(0)],
                                     maxMemSize_[N->getNthInput(0)])
                          : 0;
    for (size_t idx = 1, e = N->getNumInputs(); idx < e; ++idx) {
      const auto &input = N->getNthInput(idx);
      // Skip operands that do not require memory allocations for storing
      // their results.
      if (isa<Storage>(input))
        continue;
      assert(resultMemSize_.count(input) > 0);
      assert(maxMemSize_.count(input) > 0);
      maxSize += resultMemSize_[input];
      if (maxSize < maxMemSize_[input])
        maxSize = maxMemSize_[input];
    }
    maxMemSize_[N] = maxSize;
  }
}

DeviceManager设计

当我们考虑分布式执行时，需要将Devices管理起来，需要统一的接口，并考虑和计算图之间的交互问题，包括执行和代码生成等角度。

Variable拆分为Constants和Placeholder

较早版本的glow
用Variable表示作为Weight的容器，如input，Conv的filter和bias等。这些Weights又可以分为两类，在会变动的MutableWeights（输入，训练），和ConstantWeights（如Zero-Tensor和推理时的Weights）。Glow的实现是在IRGen时将所有Variable对应的WeightVar设置为Mutable，在IROptimizer运行时，将无需写的WeightVar属性设置为Constans。

// IRGen
case glow::Kinded::Kind::VariableKind: {
    auto *V = cast<Variable>(N);
    auto *W = builder_.createWeightVar(V->getType(), V->getName(),
                                       WeightVar::MutabilityKind::Mutable,
                                       V->getVisibilityKind());
    W->setName(N->getName());
    registerIR(N, W);
    break;
}

// IROptimizer
// For each instruction that uses the weight:
for (const auto &U : ValueUses(W)) {
  auto kind = U.getOperand().second;
  // Check if all of the users are read-only.
  if (kind != OperandKind::In) {
    readOnly = false;
    break;
  }
}

// Mark the variable as read only.
if (readOnly)
  W->setMutability(WeightVar::MutabilityKind::Constant);

备注：Varialb的Visibility属性，设置为Public(意味着可能hold a reference)时不可对其进行优化，且对应的WeightVar必须是Mutable的。

Variable的成员变量如下：

class Variable : public Node {
  /// Specifies if the variable is trainable.
  bool isTrainable_;
  /// Specifies the visibility of the variable.
  VisibilityKind visibility_;
  /// The tensor payload that the variable holds.
  Tensor payload_;
  ...
public:
  Tensor &getPayload() { return payload_; }
};

Placeholder和Constant
对比两个类的定义:

class Storage : public Node {
public:
  ... // some functions
};

class Constant : public Storage{
  /// The tensor payload that the constant holds.
  Tensor payload_;
public:
  ...
};

/// Placeholder nodes are unbound-storage. The content tensors are attached to
/// this node at runtime. Placeholders are used as inputs and output nodes to
/// the network.
class Placeholder : public Storage {
  /// Specifies if the placeholder is trainable.
  bool isTrainable_;
public:
  /// Create a new placeholder.
  Placeholder(llvm::StringRef name, TypeRef Ty, bool isTrainable)
      : Storage(Kinded::Kind::PlaceholderKind, name),
        isTrainable_(isTrainable) {
    addResult(Ty);
  }
  ...
};

在应用方面，以lenet mnist网络为例，构造网络时input是Placeholder，Conv的filte和bias同样声明为Placeholder。(注意Placeholder构造函数中的addResult(Ty))

// mnist.cpp
Placeholder *A = mod.createPlaceholder(
      ElemKind::FloatTy, {minibatchSize, 28, 28, 1}, "input", false);
...
Tensor *inputTensor = ctx.allocate(A);

// Context.cpp
Tensor *Context::allocate(Placeholder *P) {
  assert(!map_.count(P) && "Placeholder already registered");
  Tensor *T = new Tensor(P->getType());
  map_[P] = T;
  nameMap_[P->getName()] = P;
  return T;
}

Context

Context用来维护神经网络的什么信息呢？

mapping between some graph nodes and concrete tensors
traceEvents
owns the tensors 1)tensors是在成员函数allocate()中分配的；2)insert(Placeholder *P, Tensor &&T)使用了右值引用和std::move()。

/// This class provides a mapping between some graph nodes, which are a symbolic
/// representation of some computation, and concrete tensors that represent the
/// inputs and outputs to the graph. The context owns the tensors and the graph
/// uses these values as runtime. This is useful for the multi-threaded
/// execution of code, where each thread has a different execution context. The
/// difference between this class and a regular map is that the Context owns the
/// Tensors (not only the pointers) and manages their lifetime.
class Context final {
public:
  /// Maps placeholders to the tensors that back them.
  using PlaceholderMap = std::unordered_map<Placeholder *, Tensor *>;

  /// Maps Placeholder names to Placeholders.
  using PlaceholderNameMap = std::unordered_map<std::string, Placeholder *>;

private:
  /// Maps Placeholders to Tensors.
  PlaceholderMap map_;

  /// Maps Placeholder names to Placeholders.
  PlaceholderNameMap nameMap_;

  /// Trace Events recorded during this run.
  std::vector<TraceEvent> traceEvents_;

Context::insert()的使用实例如下：

// ExecutorTest.cpp
auto refCtx = llvm::make_unique<Context>();
auto *tensor = testCtx->allocate(placeholder.get());
tensor->init(Tensor::InitKind::Xavier, 1.0, rng);
refCtx->insert(placeholder.get(), tensor->clone());

Backends

class Backend
接口类，重要函数如下
compile():
virtual std::unique_ptr compile(Function *F) = 0;

class BackendUsingGlowIR : public Backend
compileIR(): 增加了从IRFunction编译到CompiledFunction的接口compileIR()（在LLVMBackend中实现）
virtual std::unique_ptr<CompiledFunction> compileIR(std::unique_ptr<IRFunction> IR) = 0;

class LLVMBackend : public BackendUsingGlowIR
compile()的实现。
compileIR()的实现。

createIRGen(): 构造LLVM IR Generator的接口。（具体的后端应该做平台相关的实现）
virtual std::unique_ptr<LLVMIRGen> createIRGen(const IRFunction *IR, AllocationsInfo &allocationsInfo) const = 0;

createCompliedFunction(): 构造CompiledFunction的接口（在CPUBackend中实现）
virtual std::unique_ptr<CompiledFunction> createCompiledFunction(std::unique_ptr<llvm::orc::GlowJIT> JIT, const runtime::RuntimeBundle &runtimeBundle) const = 0;

class CPUBackend : public LLVMBackend
createIRGen()的实现。
提供了对LLVMBackend中接口的实现。
compile(Function )
generateAndOptimizeIR(Function )
compileIR(IRFunction *)
createIRGen()
glow::generateRuntimeBundle()
createCompiledFunction()

class CompiledFunction

// execute()是最为重要的接口，CompiledFunction中没有编译后的底层IR，需要其继承类来实现。
// RuntimeBundle中保存了SymbolTable和内存大小信息(constant,
// placeholder, activation, 独立的Allocator分配，故偏移量独立)；
// 如果collectConstants，还有分配用来聚合的constants空间。
class CompiledFunction {
public:
  virtual void execute(Context *ctx) = 0;
  ...
protected:
  runtime::RuntimeBundle runtimeBundle_;
  ...
};

存储管理

CPUBackend

void LLVMCompiledFunction::execute(Context *ctx) {
  uint8_t *baseActivationsAddress{nullptr};

  /// Base address for Mutable weights memory block, Inputs and Outputs.
  uint8_t *baseMutableWeightVarsAddress{nullptr};

  if (runtimeBundle_.getActivationsSize() != 0) {
    baseActivationsAddress = (uint8_t *)alignedAlloc(
        runtimeBundle_.getActivationsSize(), TensorAlignment);
  }

  if (runtimeBundle_.getMutableWeightSize() != 0) {
    baseMutableWeightVarsAddress = (uint8_t *)alignedAlloc(
        runtimeBundle_.getMutableWeightSize(), TensorAlignment);
  }

  loadPlaceholders(ctx, baseMutableWeightVarsAddress);

  auto sym = JIT_->findSymbol("jitmain");
  assert(sym && "Unable to JIT the code!");
  using JitFuncType =
      void (*)(uint8_t * constantWeightVars, uint8_t * mutableWeightVars,
               uint8_t * activations);
  auto address = sym.getAddress();
  if (address) {
    JitFuncType funcPtr = reinterpret_cast<JitFuncType>(address.get());
    funcPtr(runtimeBundle_.getConstants(), baseMutableWeightVarsAddress,
            baseActivationsAddress);
  } else {
    GLOW_UNREACHABLE("Error getting address");
  }

  updatePlaceholders(ctx, baseMutableWeightVarsAddress);

  alignedFree(baseMutableWeightVarsAddress);
  alignedFree(baseActivationsAddress);

  translateTraceEvents(ctx);
}

Glow中间activations Tensor维度推导

我们以Convolution为例，在ConvolutionNode *Function::createConv(...)函数中，计算输出维度并作为ConvolutionNode的构造函数参数。

auto outSz = calculateConvPoolOutputDims(idim.h, idim.w, kernels, strides, pads);
...
auto OT = getParent()->uniqueType(ElemKind::FloatTy, outDims);
return addNode(new ConvolutionNode(name, OT, input, filter, bias, kernels,
                                     strides, pads, group));

其中uniqueType的返回类型是TypeRef.using TypeRef = const Type *;

1	TypeRef Module::uniqueType(ElemKind elemTy, llvm::ArrayRef<size_t> dims);