nur wen es interessiert, ich glaube feroxx wollte einen G5 test lesen
(nein brummelchen, auch dieses mal: keine minderwertigkeitskomplexe meinerseits, obwohl ich keine nasa habe )


[PDF Version]

An Evaluation of PowerMac G5 Systems for Computational Fluid Dynamics Applications
Part I: Preliminary Testing with Jet3D

Craig A. Hunter
NASA Langley Research Center
Configuration Aerodynamics Branch
Hampton, Virginia

July 2003

Evaluations and comparisons within this paper are for technical government purposes only
and do not constitute an endorsement of any system by NASA or the U.S. Government.



Introduction

This paper describes testing conducted by NASA Langley Research Center during an evaluation of a PowerMac G5 system for computational fluid dynamics (CFD) simulation. Preliminary evaluations are conducted in Part I of this two-part document. In this phase of testing, an existing version of the NASA Jet3D [1] code (developed and compiled on G4 systems) was run on a G5 system without any re-compilation or additional optimization. As G5-specific FORTRAN compiler tools have not yet been released, this is both appropriate and a reality. In Part II of this document, G5-specific testing will be conducted as revised compiler tools become available. In addition, testing will be expanded to include other NASA CFD codes.

The primary purpose of this test is to determine how G5 performance compares to G4 performance in CFD applications. Earlier work [2] showed that G4 systems performed well in vector computations but fared quite poorly in general scalar floating point computations. Since CFD simulations are heavily dependent on basic floating point performance, it is a critical area of evaluation. As a secondary part of this test, G4 and G5 benchmark results are compared to similar results obtained on Pentium 4 systems. Jet3D was compiled and optimized for the best performance on each platform, using a mix of available compiler tools on each platform. Thus, cross-platform results shown herein are not directly comparable in a true sense, but rather, reflect the reality of working in a multi-platform environment. Therefore, it is important to note that these tests are not academic or marketing benchmarks; rather, they are real-world tests conducted in a multi-platform engineering environment.


Background: Jet3D

Jet3D is a jet noise prediction tool based on Lighthill's Acoustic Analogy. Jet3D takes an existing Reynolds-averaged Navier-Stokes (RANS) CFD simulation of a jet flow as input data and post-processes it to compute jet noise. Two versions of Jet3D are tested here: the original double precision scalar code ("scalar"), and a mixed single and double precision vector code with approximate spectral integration ("vector"). In the latter version of the code, key portions of the Jet3D algorithm were rewritten to take advantage of AltiVec for increased in performance. The scalar version of the code is written entirely in FORTRAN (a mix of F77 and F90), while the vector version of the code is a mix of FORTRAN and C. Because of its dependence on AltiVec, the vector version of Jet3D runs on G4 and G5 systems only.


Approach

Testing was conducted with the documented "GE215" validation case for Jet3D [3], which predicts noise for a supersonic (Mach 1.4) convergent-divergent nozzle. The computation was run on 4 out of the 64 slices in the computational domain, over 5 observer positions. The input data for this case consists of an 11 block structured grid with 164790 nodes and 80352 cells (requiring about 1MB of memory to run).

The latest version of Jet3D (v060203) was compiled on G4 and P4 platforms. Based on extensive testing, the following compiler flags were used:

Scalar Code
G4 using Absoft F90 v8: f90 -s -O -lU77 -N11
P4 using Portland Group F90 v4.0-3: pgf90 -byteswapio -tp p7 -O1

Vector Code
G4 using Absoft F90 v8 and GCC 3.1:
cc -faltivec -O3 -Wno-precomp -ffixed-v31
f90 -s -O -lU77 -N11 -X -framework -X vecLib

Note that the higher level of optimization (-O2) and SSE/SSE2 options in the Portland compiler degraded Jet3D performance on the P4 system, and were therefore not used.


System Characteristics

Basic characteristics of the G4 and G5 systems were obtained by running a "sysctl hw" command in the shell. The results are:

G5 (dual 2Ghz PowerMac)
hw.machine = Power Macintosh
hw.model = PowerMac7,2
hw.ncpu = 2
hw.byteorder = 4321
hw.physmem = 1073741824
hw.usermem = 1001910272
hw.pagesize = 4096
hw.epoch = 1
hw.vectorunit = 1
hw.busfrequency = 1000000000
hw.cpufrequency = 2000000000
hw.cachelinesize = 128
hw.l1icachesize = 65536
hw.l1dcachesize = 32768
hw.l2settings = 2147483648
hw.l2cachesize = 524288
hw.tbfrequency = 33q932
hw.memsize = 1073741824

G4 (dual 1Ghz Xserve)
hw.machine = Power Macintosh
hw.model = RackMac1,1
hw.ncpu = 2
hw.byteorder = 4321
hw.physmem = 1610612736
hw.usermem = 1494851584
hw.pagesize = 4096
hw.epoch = 1
hw.vectorunit = 1
hw.busfrequency = 132912930
hw.cpufrequency = 999999997
hw.cachelinesize = 32
hw.l1icachesize = 32768
hw.l1dcachesize = 32768
hw.l2settings = 2148007936
hw.l2cachesize = 262144
hw.l3settings = 2668298240
hw.l3cachesize = 2097152

G4 (dual 1.25GHz PowerMac)
hw.machine = Power Macintosh
hw.model = PowerMac3,6
hw.ncpu = 2
hw.byteorder = 4321
hw.physmem = 2147483648
hw.usermem = 1963442176
hw.pagesize = 4096
hw.epoch = 1
hw.vectorunit = 1
hw.busfrequency = 166627520
hw.cpufrequency = 1249999995
hw.cachelinesize = 32
hw.l1icachesize = 32768
hw.l1dcachesize = 32768
hw.l2settings = 2147483648
hw.l2cachesize = 262144
hw.l3settings = 2676162560
hw.l3cachesize = 2097152

Additional Notes: The G5 system was running Mac OS X 10.2.7 and used 400MHz 128-bit DDR SDRAM, the Xserve G4 system was running Mac OS X Server 10.2.3 and used 266Mhz PC2100 DDR SDRAM, and the PowerMac G4 system was running Mac OS X 10.2.3 and used 333MHz PC2700 DDR SDRAM. Note that even though the G4 and G5 systems have dual processors, detailed benchmarks in the present study pertain to a single processor only.

Characteristics of the Pentium 4 systems were obtained by looking at "cpuinfo" and "meminfo" files in /proc. Selected results are:

Pentium 4 (2GHz)
vendor_id: GenuineIntel
cpu family: 15
model: 2
name: Intel(R) Pentium(R) 4 CPU 2.00GHz
stepping: 4
cpu MHz: 2008.951
cache size: 512 KB
fdiv_bug: no
hlt_bug: no
f00f_bug: no
coma_bug: no
fpu: yes
fpu_exception: yes
cpuid level: 2
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips: 4010.80
MemTotal: 2065600 kB

Pentium 4 (2.66GHz)
vendor_id: GenuineIntel
cpu family: 15
model: 2
name: Intel(R) Pentium(R) 4 CPU 2.66GHz
stepping: 7
cpu MHz: 2663.202
cache size: 512 KB
fdiv_bug: no
hlt_bug: no
f00f_bug: no
coma_bug: no
fpu: yes
fpu_exception: yes
cpuid level: 2
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips: 5308.41
MemTotal: 905720 kB

Additional Notes: The 2GHz P4 system was running Red Hat Linux 7.3 and used RAMBUS RAM. The 2.66Ghz P4 system was running Red Hat Linux 7.1 and used 333Mhz PC2700 DDR SDRAM.


Results

Benchmarks from the scalar version of Jet3D are shown in Figure 1 (MFLOPS) and Figure 2 (MFLOPS normalized by MHz). In terms of raw MFLOPS, the 2GHz G5 is about 32% faster than the 2GHz P4, 97% faster than the 1.25GHz G4, 142% faster than the 1GHz G4, and within 1 MFLOP of the 2.66GHz P4. A more useful comparison is obtained by looking at normalized benchmarks. Here, the G5 benchmarks at 0.127 MFLOPS/MHz, the two G4 machines benchmark at 0.103-0.105 MFLOPS/MHz, and the two P4 machines come in at 0.096 MFLOPS/MHz.

G4 and P4 normalized performance levels are consistent with earlier testing of Jet3D. It is interesting to note that the G4 and P4 systems have about the same normalized performance; this suggests that the lower clock speeds of G4 systems are the main reason they have lagged P4 systems for raw scalar floating point performance in Jet3D. The 2GHz G5 breaks through this limit with 22-32% higher scalar floating point performance per clock cycle than the G4 or P4. Combined with higher clock speeds, this results in significantly better floating point performance than G4 systems and performance on par with a 2.66GHz P4. Clearly, the G5 would lag faster (up to 3.2GHz) P4 systems in Jet3D scalar floating point performance, but this kind of comparison is best revisited when G5-aware compiler tools become available. An extrapolation of current P4 results to 3.2GHz would add a 20% increase in raw performance, and this could reasonably be matched with better compiler tools on the 2GHz G5.

Though dual processor benchmarks are not presented in detail here, it is worth noting that the G5 system benchmarked at 498 MFLOPS and 0.125 MFLOPS/MHz for scalar Jet3D performance when two processors were used.



Figure 1: Single CPU Jet3D Scalar Benchmarks - MFLOPS



Figure 2: Single CPU Jet3D Scalar Benchmarks - MFLOPS/MHz


Jet3D vector benchmark results are presented in Figures 3 and 4. Again, note that the vector benchmark does not include P4 systems because the AltiVec instruction set is only available on G4 and G5 systems. Consistent with earlier Jet3D tests, the vector version of Jet3D runs an order of magnitude faster than the scalar version (speedups of 10X-13X are typical). In this particular test, the raw vector performance of the G5 is impressive at 2755 MFLOPS, but a look at the normalized levels reveals that the G5 is nearly identical to the G4 in terms of vector performance per clock cycle. Thus, the increased raw vector performance of the G5 is largely due to its higher clock speed.

As before, it is worth noting that the G5 system benchmarked at 5177 MFLOPS and 1.29 MFLOPS/MHz for vector Jet3D performance when two processors were used.



Figure 3: Single CPU Jet3D Vector Benchmarks - MFLOPS


Figure 4: Single CPU Jet3D Vector Benchmarks - MFLOPS/MHz



Conclusion

The primary purpose of this test was to determine how G5 scalar floating point performance compares to G4 performance in CFD applications. As a secondary part of this test, G4 and G5 benchmark results were compared to similar results obtained on Pentium 4 systems. Overall, the scalar floating point performance of G5 systems is much improved over G4 systems due to better per clock cycle efficiency combined with higher clock speeds. Based on preliminary testing with an existing version of Jet3D (not recompiled or optimized for the G5), it appears that the G5 has about 22% better scalar floating point performance per clock cycle than the G4 systems tested and 32% better floating point performance per clock cycle than the P4 systems tested. Based on raw scalar floating point performance in Jet3D, a 2GHz G5 system can match a 2.66GHz P4 system, and this is a dramatic improvement from earlier tests where G4 systems lagged behind higher clock speed P4 systems. Based on an extrapolation of current P4 results, the 2GHz G5 would lag newly announced 3.2GHz P4 systems in Jet3D scalar floating point performance by about 20%, but this kind of comparison is best deferred until G5-aware compiler tools become available (since a 20% performance gain is well within the potential of compiler optimization).

Vector performance of the G5 remains excellent, and is inline with current G4 systems on a per clock cycle basis. As a result, raw vector performance of the G5 will be boosted simply by its higher clock speeds relative to current G4 systems.

Finally, it is important to note that the current test does not factor machine cost or intended use into the picture, and that can have a large impact, especially in clustering applications.


References

1. Hunter, C.A. and Thomas, R.H. "Development of a Jet Noise Prediction Method for Installed Jet Configurations". AIAA 2003-3169, May 2003.

2. Hunter, C.A. "An Evaluation of PowerMac G4 Systems for FORTRAN-based Scientific Computing with Application to Computational Fluid Dynamics Simulation". NASA Langley Research Center White Paper, July 2000.

3. Hunter, C.A. "An Approximate Jet Noise Prediction Method based on Reynolds-Averaged Navier-Stokes Computational Fluid Dynamics Simulation". D.Sc. Dissertation, The George Washington University, January 2002.