-
Notifications
You must be signed in to change notification settings - Fork 15.6k
[SystemZ, LoopVectorizer] Enable vectorization of epilogue loops after VF16. #172925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-systemz Author: Jonas Paulsson (JonPsson1) ChangesA significant speedup can in cases be achieved by vectorizing the epilogue loop, which has not been done until now on SystemZ. This was inspired by a recent GCC 15% speedup of a benchmark. GCC vectorizes the epilogue loop as well with VLL/VSTL instructions (Load / Store with Length), which is not implemented by this patch. However, by simple means this enables the vectorization with a VF=4 of the epilogue loop, and it seems this is just as effective (~1% total difference remaining only). I tried this patch as it is first, enabling minimum interleaving for VF16 as this is required currently to achieve the epilogue vectorization (see isEpilogueVectorizationProfitable()). This gave good "preliminary" results with x264 improving 14%. In addition, I tried some variations of this which all seemed to be slightly less preferable performance wise:
Conclusion: achieving the speedup of x264 seems to be easily achieved with this patch that simply sets the getMaxInterleaveFactor() to 2 for VF16. LV interleaving is not done currently on SystemZ - it was a while since it was tried and found non-profitable (z14?), but this small enablement seems to be for the better. Discussion: https://discourse.llvm.org/t/loopvectorizer-epilogue-vectorization-with-length-systemz/89149 One question remains as to if it would be a good idea to try to inform the LoopUnroller when a vectorized epilogue loop has been created. In the x264 case I saw the vectorized epilogue loop with a VF=4, and then the LoopUnroller also created an unrolled loop with a factor of 4. I wonder if MetaData could be emitted, so that the UnrollingPreferences could be adjusted in these cases, maybe? OTOH, this is relatively rare - ~150 files changed on SPEC. @fhahn Full diff: https://github.com/llvm/llvm-project/pull/172925.diff 3 Files Affected:
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index 2611c291abaa6..3b6f8ce08fd73 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -478,6 +478,16 @@ unsigned SystemZTTIImpl::getMinPrefetchStride(unsigned NumMemAccesses,
return ST->hasMiscellaneousExtensions3() ? 8192 : 2048;
}
+unsigned SystemZTTIImpl::getMaxInterleaveFactor(ElementCount VF) const {
+ // Enable vectorization of (LoopVectorizer) epilogue loop after VF 16 main
+ // loop. This can be highly beneficial when the original loop handles bytes
+ // (i8) and most of the time is not spent in the main vectorized loop
+ // (x264/mc_chroma).
+ if (VF == ElementCount::getFixed(16))
+ return 2;
+ return 1;
+}
+
bool SystemZTTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) const {
EVT VT = TLI->getValueType(DL, DataType);
return (VT.isScalarInteger() && TLI->isTypeLegal(VT));
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
index fc681dec1859a..929405e946c55 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
@@ -82,6 +82,8 @@ class SystemZTTIImpl final : public BasicTTIImplBase<SystemZTTIImpl> {
bool HasCall) const override;
bool enableWritePrefetching() const override { return true; }
+ unsigned getMaxInterleaveFactor(ElementCount VF) const override;
+
bool hasDivRemOp(Type *DataType, bool IsSigned) const override;
bool prefersVectorizedAddressing() const override { return false; }
bool LSRWithInstrQueries() const override { return true; }
diff --git a/llvm/test/Transforms/LoopVectorize/SystemZ/vectorized-epilogue-loop.ll b/llvm/test/Transforms/LoopVectorize/SystemZ/vectorized-epilogue-loop.ll
new file mode 100644
index 0000000000000..654d636045f76
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/SystemZ/vectorized-epilogue-loop.ll
@@ -0,0 +1,28 @@
+; RUN: opt -S -mtriple=s390x-unknown-linux -mcpu=z16 -passes=loop-vectorize < %s \
+; RUN: | FileCheck %s
+;
+; Test that loop vectorizer generates a vectorized epilogue loop after a VF16
+; vectorization.
+
+define void @fun(ptr %Src, ptr %Dst, i64 %wide.trip.count) {
+; CHECK-LABEL: @fun(
+; CHECK-LABEL: vector.body:
+; CHECK: %wide.load = load <16 x i8>
+; CHECK-LABEL: vec.epilog.vector.body:
+; CHECK: %wide.load8 = load <4 x i8>
+entry:
+ br label %for.body
+
+for.body:
+ %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+ %indvars.iv.next = add i64 %indvars.iv, 1
+ %arrayidx0 = getelementptr i8, ptr %Src, i64 %indvars.iv
+ %0 = load i8, ptr %arrayidx0
+ %arrayidx1 = getelementptr i8, ptr %Dst, i64 %indvars.iv
+ store i8 %0, ptr %arrayidx1
+ %exitcond.not = icmp eq i64 %indvars.iv, %wide.trip.count
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret void
+}
|
|
I don't think we should make the choice of what to return from If I understand the semantics correctly, the intent is for We currently do not provide that routine at all, which is equivalent to always returning 1 - that seems wrong. On the other hand, returning 2 only for VF == 16 doesn't make much sense to me; the number of vector units is always 2, no matter whether we operate on vectors of bytes, halfwords, words, doublewords ... What happens if we just always return 2 here? ("always" meaning OTOH it is also not immediately obvious to me what the point of this check is in the first place: Why should the epilogue vectorization question be tied to the interleave factor at all? In particular given that if a target wants to opt out of epilogue vectorization, it can use the |
…by default. Return true for SystemZ. Leave SystemZ getMaxInterleaveFactor() to return 1 always for now.
I agree - this was my impression as well. There were some things that made my first try still reasonable I thought. Note for instance that getEpilogueVectorizationMinVF() is also 16 by default. And it was only changing a few files, and for the better if anything it seemed. Nevertheless, I have now refactored that a bit so that we can treat epilogue vectorization and interleaving separately. My guess is that there was some concern for code size and that if interleaving was disabled, then probably an extra vectorized epilogue should be avoided.
I am not quite sure about that comment - IIRC the interleaving is originally about being able to vectorize loops with strided memory accesses containing gaps - if the loop is "unrolled" it can be vectorized after some shuffling of inter-iteration memory operands. I guess it also relates a bit to the number of vector units, although a vectorized loop can contain many operations to be done already without interleaving/unrolling.
I will run it and see. I am guessing this enables interleaving generally with some mixed results.
Yes - refactored per above. If the interleaving is about being able to vectorize loops after unrolling them, I thought (right or wrong) that the most benefit could come with VF=16, because that would create many more elements to be interleaved, compared to interleaving a VF=2 loop. And by also setting the max interleave for VF=16 to 2, that would be the least amount of interleaving in the most beneficial case. So I tried that and compared to not doing interleaving, but still enabling the vec-epilogues, and found that the interleaving was not only ok, but perhaps even beneficial. This was however a small difference on a "preliminary" run, so it makes most sense at this point to evaluate these two things separately and properly. The patch as it is now enables the vectorized epilogues for SystemZ and leaves the tuning of interleaving as a "TODO". |
|
@fhahn Does the refactoring look ok to you? |
|
The ("preliminary") results of trying "2 if not scalar" as interleave factor: First try against main showed good results although 2 slight regressions as well. I also tried this on top of the upcoming mischeduler strategy. The imagick improvement seems to be achieved by either of these two patches, or with both of them - no change this time. It seems the xalancbmk improvement might be real. The regressions were slightly less this time. |
A significant speedup can in cases be achieved by vectorizing the epilogue loop, which has not been done until now on SystemZ. This was inspired by a recent GCC 15% speedup of a benchmark. GCC vectorizes the epilogue loop as well with VLL/VSTL instructions (Load / Store with Length), which is not implemented by this patch. However, by simple means this enables the vectorization with a VF=4 of the epilogue loop, and it seems this is just as effective (~1% total difference remaining only).
I tried this patch as it is first, enabling minimum interleaving for VF16 as this is required currently to achieve the epilogue vectorization (see isEpilogueVectorizationProfitable()). This gave good "preliminary" results with x264 improving 14%.
In addition, I tried some variations of this which all seemed to be slightly less preferable performance wise:
Conclusion: achieving the speedup of x264 seems to be easily achieved with this patch that simply sets the getMaxInterleaveFactor() to 2 for VF16. LV interleaving is not done currently on SystemZ - it was a while since it was tried and found non-profitable (z14?), but this small enablement seems to be for the better.
Discussion: https://discourse.llvm.org/t/loopvectorizer-epilogue-vectorization-with-length-systemz/89149
One question remains as to if it would be a good idea to try to inform the LoopUnroller when a vectorized epilogue loop has been created. In the x264 case I saw the vectorized epilogue loop with a VF=4, and then the LoopUnroller also created an unrolled loop with a factor of 4. I wonder if MetaData could be emitted, so that the UnrollingPreferences could be adjusted in these cases, maybe? OTOH, this is relatively rare - ~150 files changed on SPEC. @fhahn