2022Tencent Rhino-bird Open-source Training Program—Angel-Zihan Li-Week5&6

---
Angel项目第五周&第六周进展
---
**当前进展**:

1. 整体：
    - 基本跑通local任务，得到各个顶点的embedding
    - 同时考虑到分布式任务要求完成了部分模块的改造和嵌入
    - 利用PCA (主成分分析) 对得到的各个顶点的embedding完成了降维，并得到了初步的可视化结果
  
2. 具体：
    - 完成对构建context graph部分以及bias-random-walk采样生成context path的代码实现
    - 完成各算法模块在spark编程框架内的嵌入，并完成接口间的测试 
    - 输入Zachary’s Karate network得到各个顶点的embedding，基于PCA得到压缩后的各顶点的二维向量，并使用Origin2021对得到的二维向量进行可视化


**当前代码工作的测试及相应结果**:

1. 输入：
    - Zachary’s Karate network（34 nodes 78 undirected edges）
    - walk length: 15
    - epochNum(训练轮数): 5
    - vector size (word2vec) : 10
    - window size (word2vec) : 3
    - stay(停留在当前层的概率): 0.5
    
2. 实际输出：
- 运行后得到的采样路径 
![res1](https://user-images.githubusercontent.com/90893013/189535524-b7ec7f37-0a5e-4a3f-bd02-9031bac50146.png)
- 根据context path由word2vec生成各个顶点的embedding
![res3](https://user-images.githubusercontent.com/90893013/189535642-30b2d415-1149-489f-b27a-24cbd522b0d5.png)
![res5](https://user-images.githubusercontent.com/90893013/189536055-518aa0fe-c16c-40af-948a-1cd4e13d16f8.png)
- 由pca对得到的embedding进行降维
![res4](https://user-images.githubusercontent.com/90893013/189535984-70ad532d-168e-4b6d-8a2a-6da58a4f95c7.png)
- 运行成功
![SUCESS](https://user-images.githubusercontent.com/90893013/189536157-43063d8f-4c88-447a-9dde-00b5a1f6006f.png)





3. 可视化结果：
- 在origin中输入
![origin表](https://user-images.githubusercontent.com/90893013/189536240-50875c36-9450-4def-ae5b-177ca16c3d24.png)
- 可视化为散点图
![可视化图](https://user-images.githubusercontent.com/90893013/189536253-1a62b7b5-ccb1-4e09-a472-eebc6caaf671.png)


**遇到的问题**:
- Q1：输出采样路径时发现或者出现连续采样到相同顶点的/或者出现运行错误(Array out of boundary)
- S1：通过debug发现两处问题: 
          1. scala中Double.NaN不等于任何数（包括其自身），判断时应该使用 “.isNaN()”
          2.构建层间跳跃时，忽视某顶点的高一层对应顶点可能其邻点集合为空，在实现时判断逻辑优化为如下：
                -  在构建多层网络时在RDD中直接过滤掉邻点集为空的元素
                -  在进行跨层判断时，先判断该顶点是否存在上层对应点，若不存在，则跨层只向下                   

- Q2：调用Spark milb中的算法时，接口所要求的的Dataframe中列的类型不匹配(如word2vec要求输入列元素类型为Array[String])
- S2：在重写函数 transformSchema中更改相应类型，例如: `  override def transformSchema(schema: StructType): StructType = {
    StructType(Seq(StructField("src",IntegerType, nullable = false),StructField("epochNum",IntegerType,nullable = false),StructField("path",ArrayType(StringType),nullable = false)))
  }`

- Q3: 如何将每轮算法的运行进行切分和封装以最终实现分布式
- S3：进一步参考源码中“DeepWalkPartition”和"DeepWalkPSModel"



**未来的工作**:
- Item1：继续进行算法分布式版本实现的工作，特别是实现“Struc2VecPartition”和“Struc2VecPSModel”
- Item2：调整算法的超参数，优化得到的结果
- Item3：在算法整体实现完成后的基础上根据论文完善对实现代码细节的优化





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2022Tencent Rhino-bird Open-source Training Program—Angel-Zihan Li-Week5&6 #1241

Angel项目第五周&第六周进展

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

2022Tencent Rhino-bird Open-source Training Program—Angel-Zihan Li-Week5&6 #1241

Description

Angel项目第五周&第六周进展

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions