Skip to content

Conversation

@JonPsson1
Copy link
Contributor

A significant speedup can in cases be achieved by vectorizing the epilogue loop, which has not been done until now on SystemZ. This was inspired by a recent GCC 15% speedup of a benchmark. GCC vectorizes the epilogue loop as well with VLL/VSTL instructions (Load / Store with Length), which is not implemented by this patch. However, by simple means this enables the vectorization with a VF=4 of the epilogue loop, and it seems this is just as effective (~1% total difference remaining only).

I tried this patch as it is first, enabling minimum interleaving for VF16 as this is required currently to achieve the epilogue vectorization (see isEpilogueVectorizationProfitable()). This gave good "preliminary" results with x264 improving 14%.

In addition, I tried some variations of this which all seemed to be slightly less preferable performance wise:

  • Increase the max interleave for VF16 to 4 instead of 2.
  • Instead of increasing the max interleave, comment away the TTI.getMaxInterleaveFactor() check in isEpilogueVectorizationProfitable(), so that the side-effect of actual interleaving does not happen. Tried this with epilogue-vectorization-minimum-VF set to 16 and also with 8.

Conclusion: achieving the speedup of x264 seems to be easily achieved with this patch that simply sets the getMaxInterleaveFactor() to 2 for VF16. LV interleaving is not done currently on SystemZ - it was a while since it was tried and found non-profitable (z14?), but this small enablement seems to be for the better.

Discussion: https://discourse.llvm.org/t/loopvectorizer-epilogue-vectorization-with-length-systemz/89149

One question remains as to if it would be a good idea to try to inform the LoopUnroller when a vectorized epilogue loop has been created. In the x264 case I saw the vectorized epilogue loop with a VF=4, and then the LoopUnroller also created an unrolled loop with a factor of 4. I wonder if MetaData could be emitted, so that the UnrollingPreferences could be adjusted in these cases, maybe? OTOH, this is relatively rare - ~150 files changed on SPEC. @fhahn

@llvmbot
Copy link
Member

llvmbot commented Dec 18, 2025

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-systemz

Author: Jonas Paulsson (JonPsson1)

Changes

A significant speedup can in cases be achieved by vectorizing the epilogue loop, which has not been done until now on SystemZ. This was inspired by a recent GCC 15% speedup of a benchmark. GCC vectorizes the epilogue loop as well with VLL/VSTL instructions (Load / Store with Length), which is not implemented by this patch. However, by simple means this enables the vectorization with a VF=4 of the epilogue loop, and it seems this is just as effective (~1% total difference remaining only).

I tried this patch as it is first, enabling minimum interleaving for VF16 as this is required currently to achieve the epilogue vectorization (see isEpilogueVectorizationProfitable()). This gave good "preliminary" results with x264 improving 14%.

In addition, I tried some variations of this which all seemed to be slightly less preferable performance wise:

  • Increase the max interleave for VF16 to 4 instead of 2.
  • Instead of increasing the max interleave, comment away the TTI.getMaxInterleaveFactor() check in isEpilogueVectorizationProfitable(), so that the side-effect of actual interleaving does not happen. Tried this with epilogue-vectorization-minimum-VF set to 16 and also with 8.

Conclusion: achieving the speedup of x264 seems to be easily achieved with this patch that simply sets the getMaxInterleaveFactor() to 2 for VF16. LV interleaving is not done currently on SystemZ - it was a while since it was tried and found non-profitable (z14?), but this small enablement seems to be for the better.

Discussion: https://discourse.llvm.org/t/loopvectorizer-epilogue-vectorization-with-length-systemz/89149

One question remains as to if it would be a good idea to try to inform the LoopUnroller when a vectorized epilogue loop has been created. In the x264 case I saw the vectorized epilogue loop with a VF=4, and then the LoopUnroller also created an unrolled loop with a factor of 4. I wonder if MetaData could be emitted, so that the UnrollingPreferences could be adjusted in these cases, maybe? OTOH, this is relatively rare - ~150 files changed on SPEC. @fhahn


Full diff: https://github.com/llvm/llvm-project/pull/172925.diff

3 Files Affected:

  • (modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp (+10)
  • (modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h (+2)
  • (added) llvm/test/Transforms/LoopVectorize/SystemZ/vectorized-epilogue-loop.ll (+28)
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index 2611c291abaa6..3b6f8ce08fd73 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -478,6 +478,16 @@ unsigned SystemZTTIImpl::getMinPrefetchStride(unsigned NumMemAccesses,
   return ST->hasMiscellaneousExtensions3() ? 8192 : 2048;
 }
 
+unsigned SystemZTTIImpl::getMaxInterleaveFactor(ElementCount VF) const {
+  // Enable vectorization of (LoopVectorizer) epilogue loop after VF 16 main
+  // loop. This can be highly beneficial when the original loop handles bytes
+  // (i8) and most of the time is not spent in the main vectorized loop
+  // (x264/mc_chroma).
+  if (VF == ElementCount::getFixed(16))
+    return 2;
+  return 1;
+}
+
 bool SystemZTTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) const {
   EVT VT = TLI->getValueType(DL, DataType);
   return (VT.isScalarInteger() && TLI->isTypeLegal(VT));
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
index fc681dec1859a..929405e946c55 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
@@ -82,6 +82,8 @@ class SystemZTTIImpl final : public BasicTTIImplBase<SystemZTTIImpl> {
                                 bool HasCall) const override;
   bool enableWritePrefetching() const override { return true; }
 
+  unsigned getMaxInterleaveFactor(ElementCount VF) const override;
+
   bool hasDivRemOp(Type *DataType, bool IsSigned) const override;
   bool prefersVectorizedAddressing() const override { return false; }
   bool LSRWithInstrQueries() const override { return true; }
diff --git a/llvm/test/Transforms/LoopVectorize/SystemZ/vectorized-epilogue-loop.ll b/llvm/test/Transforms/LoopVectorize/SystemZ/vectorized-epilogue-loop.ll
new file mode 100644
index 0000000000000..654d636045f76
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/SystemZ/vectorized-epilogue-loop.ll
@@ -0,0 +1,28 @@
+; RUN: opt -S -mtriple=s390x-unknown-linux -mcpu=z16 -passes=loop-vectorize < %s \
+; RUN:   | FileCheck %s
+;
+; Test that loop vectorizer generates a vectorized epilogue loop after a VF16
+; vectorization.
+
+define void @fun(ptr %Src, ptr %Dst, i64 %wide.trip.count) {
+; CHECK-LABEL: @fun(
+; CHECK-LABEL: vector.body:
+; CHECK: %wide.load = load <16 x i8>
+; CHECK-LABEL: vec.epilog.vector.body:
+; CHECK: %wide.load8 = load <4 x i8>
+entry:
+  br label %for.body
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %indvars.iv.next = add i64 %indvars.iv, 1
+  %arrayidx0 = getelementptr i8, ptr %Src, i64 %indvars.iv
+  %0 = load i8, ptr %arrayidx0
+  %arrayidx1 = getelementptr i8, ptr %Dst, i64 %indvars.iv
+  store i8 %0, ptr %arrayidx1
+  %exitcond.not = icmp eq i64 %indvars.iv, %wide.trip.count
+  br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+  ret void
+}

@uweigand
Copy link
Member

I don't think we should make the choice of what to return from getMaxInterleaveFactor just so we this particular side effect in one specific test case. Rather, we should always return the "correct" result from getMaxInterleaveFactor and then try to fix any fallout where this has undesired effects.

If I understand the semantics correctly, the intent is for getMaxInterleaveFactor to return the maximum number of parallel executions of vector operations (or scalar operations if VF == 1), i.e. the number of vector units the processor has. This would simply always be 2 on all current IBM Z hardware.

We currently do not provide that routine at all, which is equivalent to always returning 1 - that seems wrong. On the other hand, returning 2 only for VF == 16 doesn't make much sense to me; the number of vector units is always 2, no matter whether we operate on vectors of bytes, halfwords, words, doublewords ...

What happens if we just always return 2 here? ("always" meaning !VF.isScalar() or something.)

OTOH it is also not immediately obvious to me what the point of this check is in the first place:

  // We also consider epilogue vectorization unprofitable for targets that don't
  // consider interleaving beneficial (eg. MVE).
  if (TTI.getMaxInterleaveFactor(VF) <= 1)
    return false;

Why should the epilogue vectorization question be tied to the interleave factor at all? In particular given that if a target wants to opt out of epilogue vectorization, it can use the preferEpilogueVectorization hook anyway ...

…by default.

Return true for SystemZ.
Leave SystemZ getMaxInterleaveFactor() to return 1 always for now.
@llvmbot llvmbot added backend:RISC-V vectorizers llvm:analysis Includes value tracking, cost tables and constant folding labels Dec 19, 2025
@JonPsson1
Copy link
Contributor Author

I don't think we should make the choice of what to return from getMaxInterleaveFactor just so we this particular side effect in one specific test case. Rather, we should always return the "correct" result from getMaxInterleaveFactor and then try to fix any fallout where this has undesired effects.

I agree - this was my impression as well. There were some things that made my first try still reasonable I thought. Note for instance that getEpilogueVectorizationMinVF() is also 16 by default. And it was only changing a few files, and for the better if anything it seemed. Nevertheless, I have now refactored that a bit so that we can treat epilogue vectorization and interleaving separately. My guess is that there was some concern for code size and that if interleaving was disabled, then probably an extra vectorized epilogue should be avoided.

If I understand the semantics correctly, the intent is for getMaxInterleaveFactor to return the maximum number of parallel executions of vector operations (or scalar operations if VF == 1), i.e. the number of vector units the processor has. This would simply always be 2 on all current IBM Z hardware.

I am not quite sure about that comment - IIRC the interleaving is originally about being able to vectorize loops with strided memory accesses containing gaps - if the loop is "unrolled" it can be vectorized after some shuffling of inter-iteration memory operands. I guess it also relates a bit to the number of vector units, although a vectorized loop can contain many operations to be done already without interleaving/unrolling.

We currently do not provide that routine at all, which is equivalent to always returning 1 - that seems wrong. On the other hand, returning 2 only for VF == 16 doesn't make much sense to me; the number of vector units is always 2, no matter whether we operate on vectors of bytes, halfwords, words, doublewords ...

What happens if we just always return 2 here? ("always" meaning !VF.isScalar() or something.)

I will run it and see. I am guessing this enables interleaving generally with some mixed results.

OTOH it is also not immediately obvious to me what the point of this check is in the first place:

  // We also consider epilogue vectorization unprofitable for targets that don't
  // consider interleaving beneficial (eg. MVE).
  if (TTI.getMaxInterleaveFactor(VF) <= 1)
    return false;

Why should the epilogue vectorization question be tied to the interleave factor at all? In particular given that if a target wants to opt out of epilogue vectorization, it can use the preferEpilogueVectorization hook anyway ...

Yes - refactored per above.

If the interleaving is about being able to vectorize loops after unrolling them, I thought (right or wrong) that the most benefit could come with VF=16, because that would create many more elements to be interleaved, compared to interleaving a VF=2 loop. And by also setting the max interleave for VF=16 to 2, that would be the least amount of interleaving in the most beneficial case. So I tried that and compared to not doing interleaving, but still enabling the vec-epilogues, and found that the interleaving was not only ok, but perhaps even beneficial. This was however a small difference on a "preliminary" run, so it makes most sense at this point to evaluate these two things separately and properly. The patch as it is now enables the vectorized epilogues for SystemZ and leaves the tuning of interleaving as a "TODO".

@JonPsson1
Copy link
Contributor Author

@fhahn Does the refactoring look ok to you?

@JonPsson1
Copy link
Contributor Author

The ("preliminary") results of trying "2 if not scalar" as interleave factor: First try against main showed good results although 2 slight regressions as well. I also tried this on top of the upcoming mischeduler strategy. The imagick improvement seems to be achieved by either of these two patches, or with both of them - no change this time. It seems the xalancbmk improvement might be real. The regressions were slightly less this time.

main <> + getMaxInterleaveFactor() "2 if not scalar."

Improvements
0.837: f538.imagick_r
0.870: i525.x264_r
0.970: i523.xalancbmk_r
0.989: i520.omnetpp_r
0.989: i557.xz_r
0.991: f510.parest_r

Regressions 2017_C_MaxInterleaveFac_2_AllVFs:
1.025: f511.povray_r
1.010: i500.perlbench_r
misched patch <> + getMaxInterleaveFactor() "2 if not scalar."

Improvements 2017_E_misched_final_MaxInterleaveFac_2_AllVFs:
0.860: i525.x264_r
0.968: i523.xalancbmk_r
0.986: i505.mcf_r
0.991: f510.parest_r

Regressions 2017_E_misched_final_MaxInterleaveFac_2_AllVFs:
1.017: f511.povray_r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:RISC-V backend:SystemZ llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants