Skip to content

Performance degradation after ggml sync  #1273

@bobqianic

Description

@bobqianic

Not only did we have issues with the benchmark, but I also observed a notable drop in CPU performance for ggml after syncing.

i7-12700H ggml-model-whisper-base.bin OpenBLAS=1 encode time

2f52783 (ms/run) Master (ms/run)
726.22 825.36
745.15 788.72
763.46 789.90
771.82 787.58
757.62 845.77
797.99 830.51
702.00 825.29
722.43 808.82
760.68 803.65
793.12 824.25

Master i7-12700H OpenBLAS=1 -t 4

  64 x   64: Q4_0     2.7 GFLOPS (128 runs) | Q4_1     3.6 GFLOPS (128 runs)
  64 x   64: Q5_0     3.6 GFLOPS (128 runs) | Q5_1     3.5 GFLOPS (128 runs) | Q8_0     3.5 GFLOPS (128 runs)
  64 x   64: F16      3.5 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0     7.6 GFLOPS (128 runs) | Q4_1    13.5 GFLOPS (128 runs)
 128 x  128: Q5_0    13.0 GFLOPS (128 runs) | Q5_1    12.3 GFLOPS (128 runs) | Q8_0    12.9 GFLOPS (128 runs)
 128 x  128: F16     19.6 GFLOPS (128 runs) | F32     18.4 GFLOPS (128 runs)
 256 x  256: Q4_0    48.3 GFLOPS (128 runs) | Q4_1    40.6 GFLOPS (128 runs)
 256 x  256: Q5_0    49.4 GFLOPS (128 runs) | Q5_1    39.5 GFLOPS (128 runs) | Q8_0    16.0 GFLOPS (128 runs)
 256 x  256: F16     57.4 GFLOPS (128 runs) | F32     44.6 GFLOPS (128 runs)
 512 x  512: Q4_0    97.1 GFLOPS (128 runs) | Q4_1   114.9 GFLOPS (128 runs)
 512 x  512: Q5_0   104.6 GFLOPS (128 runs) | Q5_1   113.3 GFLOPS (128 runs) | Q8_0    72.7 GFLOPS (128 runs)
 512 x  512: F16    129.7 GFLOPS (128 runs) | F32    105.8 GFLOPS (128 runs)
1024 x 1024: Q4_0   152.5 GFLOPS ( 72 runs) | Q4_1   161.2 GFLOPS ( 76 runs)
1024 x 1024: Q5_0   150.1 GFLOPS ( 70 runs) | Q5_1   157.9 GFLOPS ( 74 runs) | Q8_0   144.5 GFLOPS ( 68 runs)
1024 x 1024: F16    168.0 GFLOPS ( 79 runs) | F32    190.4 GFLOPS ( 89 runs)
2048 x 2048: Q4_0   211.2 GFLOPS ( 13 runs) | Q4_1   232.3 GFLOPS ( 14 runs)
2048 x 2048: Q5_0   210.7 GFLOPS ( 13 runs) | Q5_1   230.4 GFLOPS ( 14 runs) | Q8_0   224.5 GFLOPS ( 14 runs)
2048 x 2048: F16    231.2 GFLOPS ( 14 runs) | F32    238.1 GFLOPS ( 15 runs)
4096 x 4096: Q4_0   328.0 GFLOPS (  3 runs) | Q4_1   305.7 GFLOPS (  3 runs)
4096 x 4096: Q5_0   295.3 GFLOPS (  3 runs) | Q5_1   305.8 GFLOPS (  3 runs) | Q8_0   292.8 GFLOPS (  3 runs)
4096 x 4096: F16    308.7 GFLOPS (  3 runs) | F32    299.2 GFLOPS (  3 runs)

2f52783 i7-12700H OpenBLAS=1 -t 4

  64 x   64: Q5_0     3.9 GFLOPS (128 runs) | Q5_1     3.7 GFLOPS (128 runs) | Q8_0     3.7 GFLOPS (128 runs)
  64 x   64: F16      3.5 GFLOPS (128 runs) | F32      2.8 GFLOPS (128 runs)
 128 x  128: Q4_0    19.8 GFLOPS (128 runs) | Q4_1    20.3 GFLOPS (128 runs)
 128 x  128: Q5_0    19.8 GFLOPS (128 runs) | Q5_1    19.2 GFLOPS (128 runs) | Q8_0    19.4 GFLOPS (128 runs)
 128 x  128: F16     22.0 GFLOPS (128 runs) | F32     21.5 GFLOPS (128 runs)
 256 x  256: Q4_0   106.8 GFLOPS (128 runs) | Q4_1   103.8 GFLOPS (128 runs)
 256 x  256: Q5_0   102.0 GFLOPS (128 runs) | Q5_1   100.6 GFLOPS (128 runs) | Q8_0   107.6 GFLOPS (128 runs)
 256 x  256: F16    115.7 GFLOPS (128 runs) | F32     85.3 GFLOPS (128 runs)
 512 x  512: Q4_0   137.3 GFLOPS (128 runs) | Q4_1   143.4 GFLOPS (128 runs)
 512 x  512: Q5_0   133.7 GFLOPS (128 runs) | Q5_1   132.4 GFLOPS (128 runs) | Q8_0   109.1 GFLOPS (128 runs)
 512 x  512: F16    138.5 GFLOPS (128 runs) | F32    101.9 GFLOPS (128 runs)
1024 x 1024: Q4_0   201.7 GFLOPS ( 94 runs) | Q4_1   194.3 GFLOPS ( 91 runs)
1024 x 1024: Q5_0   172.8 GFLOPS ( 81 runs) | Q5_1   176.0 GFLOPS ( 83 runs) | Q8_0   167.9 GFLOPS ( 79 runs)
1024 x 1024: F16    189.0 GFLOPS ( 89 runs) | F32    142.1 GFLOPS ( 67 runs)
2048 x 2048: Q4_0   316.3 GFLOPS ( 19 runs) | Q4_1   320.2 GFLOPS ( 19 runs)
2048 x 2048: Q5_0   303.9 GFLOPS ( 18 runs) | Q5_1   299.2 GFLOPS ( 18 runs) | Q8_0   303.3 GFLOPS ( 18 runs)
2048 x 2048: F16    297.9 GFLOPS ( 18 runs) | F32    240.5 GFLOPS ( 14 runs)
4096 x 4096: Q4_0   368.8 GFLOPS (  3 runs) | Q4_1   364.6 GFLOPS (  3 runs)
4096 x 4096: Q5_0   391.0 GFLOPS (  3 runs) | Q5_1   341.6 GFLOPS (  3 runs) | Q8_0   372.5 GFLOPS (  3 runs)
4096 x 4096: F16    344.3 GFLOPS (  3 runs) | F32    345.3 GFLOPS (  3 runs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformanceCPU and memory usage - results and comparisons

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions