Parallelization
A first attempt of parallelization of FreeFEM is made here with MPI. An extended interface with MPI has been added to FreeFEM version 3.5, (see the MPI documentation for the functionality of the language).
MPI
MPI Keywords
The following keywords and concepts are used:
mpiComm
to defined a communication worldmpiGroup
to defined a group of processors in the communication worldmpiRequest
to defined a request to wait for the end of the communication
MPI Constants
mpisize
The total number of processes,mpirank
the id-number of my current process in{0, ..., mpisize-1}
,mpiUndefined
TheMPI_Undefined
constant,mpiAnySource
TheMPI_ANY_SOURCE
constant,mpiCommWorld
TheMPI_COMM_WORLD
constant,[ … ] and all the keywords of
MPI_Op
for the reduce operator:mpiMAX
,mpiMIN
,mpiSUM
,mpiPROD
,mpiLAND
,mpiLOR
,mpiLXOR
,mpiBAND
,mpiBXOR
.
MPI Constructor
1// Parameters
2int[int] proc1 = [1, 2], proc2 = [0, 3];
3int color = 1;
4int key = 1;
5
6// MPI ranks
7cout << "MPI rank = " << mpirank << endl;
8
9// MPI
10mpiComm comm(mpiCommWorld, 0, 0); //set a MPI_Comm to MPI_COMM_WORLD
11
12mpiGroup grp(proc1); //set MPI_Group to proc 1,2 in MPI_COMM_WORLD
13mpiGroup grp1(comm, proc1); //set MPI_Group to proc 1,2 in comm
14
15mpiComm ncomm1(mpiCommWorld, grp); //set the MPI_Comm form grp
16
17mpiComm ncomm2(comm, color, key); //MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *ncomm)
18
19mpiRequest rq; //defined an MPI_Request
20mpiRequest[int] arq(10); //defined an array of 10 MPI_Request
MPI Functions
1mpiComm Comm(mpiCommWorld, 0, 0);
2
3int MPICommSize = mpiSize(Comm);
4int MPIRank = mpiRank(Comm);
5
6if (MPIRank == 0) cout << "MPI Comm size = " << MPICommSize << endl;
7cout << "MPI rank in Comm = " << mpiRank(Comm) << endl;
8
9mpiRequest Req;
10mpiRequest[int] ReqArray(10);
11
12for (int i = 0; i < MPICommSize; i++){
13 //return processor i with no Resquest in MPI_COMM_WORLD
14 processor(i);
15 //return processor any source with no Resquest in MPI_COMM_WORLD
16 processor(mpiAnySource);
17 //return processor i with no Resquest in Comm
18 processor(i, Comm);
19 //return processor i with no Resquest in Comm
20 processor(Comm, i);
21 //return processor i with Resquest rq in Comm
22 /* processor(i, Req, Comm);
23 //return processor i with Resquest rq in MPI_COMM_WORLD
24 processor(i, Req); */
25 //return processor i in MPI_COMM_WORLD in block mode for synchronously communication
26 processorblock(i);
27 //return processor any source in MPI_COMM_WORLD in block mode for synchronously communication
28 processorblock(mpiAnySource);
29 //return processor i in in Comm in block mode
30 processorblock(i, Comm);
31}
32
33mpiBarrier(Comm); //do a MPI_Barrier on communicator Comm
34mpiWaitAny(ReqArray); //wait add of Request array,
35mpiWait(Req); //wait on a Request
36real t = mpiWtime(); //return MPIWtime in second
37real tick = mpiWtick(); //return MPIWTick in second
where a processor
is just a integer rank, pointer to a MPI_comm
and pointer to a MPI_Request
, and processorblock
with a special MPI_Request
.
MPI Communicator operator
1int status; //to get the MPI status of send / recv
2real a, b;
3
4mpiComm comm(mpiCommWorld, 0, 0);
5mpiRequest req;
6
7//send a,b asynchronously to the process 1
8processor(1) << a << b;
9//receive a,b synchronously from the process 10
10processor(10) >> a >> b;
11
12//broadcast from processor of comm to other comm processor
13// broadcast(processor(10, comm), a);
14//send synchronously to the process 10 the data a
15status = Send(processor(10, comm), a);
16//receive synchronously from the process 10 the data a
17status = Recv(processor(10, comm), a);
18
19//send asynchronously to the process 10 the data a without request
20status = Isend(processor(10, comm), a);
21//send asynchronously to the process 10 the data a with request
22status = Isend(processor(10, comm, req), a);
23//receive asynchronously from the process 10 the data a
24status = Irecv(processor(10, req), a);
25//Error asynchronously without request.
26// status = Irecv(processor(10), a);
where the data type of a
can be of type of int
, real
, complex
, int[int]
, real[int]
, complex[int]
, int[int,int]
, double[int,int]
, complex[int,int]
, mesh
, mesh3
, mesh[int]
, mesh3[int]
, matrix
, matrix<complex>
1//send asynchronously to the process 10 the data a with request
2processor(10, req) << a ;
3//receive asynchronously from the process 10 the data a with request
4processor(10, req) >> a ;
If a, b
are arrays or full matrices of int
, real
, or complex
, we can use the following MPI functions:
1mpiAlltoall(a, b, [comm]);
2mpiAllgather(a, b, [comm]);
3mpiGather(a, b, processor(..) );
4mpiScatter(a, b, processor(..));
5mpiReduce(a, b, processor(..), mpiMAX);
6mpiAllReduce(a, b, comm, mpiMAX);
Thank you to Guy-Antoine Atenekeng Kahou for his help to code this interface.
Schwarz example in parallel
This example is a rewritting of example Schwarz overlapping.
1ff-mpirun -np 2 SchwarzParallel.edp
2# OR
3mpirun -np 2 FreeFem++-mpi SchwarzParallel.edp
1if (mpisize != 2){
2 cout << " sorry, number of processors !=2 " << endl;
3 exit(1);
4}
5
6// Parameters
7verbosity = 0;
8int interior = 2;
9int exterior = 1;
10int n = 4;
11
12// Mesh
13border a(t=1, 2){x=t; y=0; label=exterior;}
14border b(t=0, 1){x=2; y=t; label=exterior;}
15border c(t=2, 0){x=t; y=1; label=exterior;}
16border d(t=1, 0){x=1-t; y=t; label=interior;}
17border e(t=0, pi/2){x=cos(t); y=sin(t); label=interior;}
18border e1(t=pi/2, 2*pi){x=cos(t); y=sin(t); label=exterior;}
19mesh[int] Th(mpisize);
20if (mpirank == 0)
21 Th[0] = buildmesh(a(5*n) + b(5*n) + c(10*n) + d(5*n));
22else
23 Th[1] = buildmesh(e(5*n) + e1(25*n));
24
25broadcast(processor(0), Th[0]);
26broadcast(processor(1), Th[1]);
27
28// Fespace
29fespace Vh(Th[mpirank], P1);
30Vh u = 0, v;
31
32fespace Vhother(Th[1-mpirank], P1);
33Vhother U = 0;
34
35//Problem
36int i = 0;
37problem pb (u, v, init=i, solver=Cholesky)
38 = int2d(Th[mpirank])(
39 dx(u)*dx(v)
40 + dy(u)*dy(v)
41 )
42 - int2d(Th[mpirank])(
43 v
44 )
45 + on(interior, u=U)
46 + on(exterior, u= 0 )
47 ;
48
49// Loop
50for (i = 0; i < 20; i++){
51 cout << mpirank << " - Loop " << i << endl;
52
53 // Solve
54 pb;
55 //send u to the other proc, receive in U
56 processor(1-mpirank) << u[];
57 processor(1-mpirank) >> U[];
58
59 // Error
60 real err0, err1;
61 err0 = int1d(Th[mpirank],interior)(square(U - u));
62 // send err0 to the other proc, receive in err1
63 processor(1-mpirank) << err0;
64 processor(1-mpirank) >> err1;
65 real err = sqrt(err0 + err1);
66 cout << " err = " << err << " - err0 = " << err0 << " - err1 = " << err1 << endl;
67 if (err < 1e-3) break;
68}
69if (mpirank == 0)
70 plot(u, U);
Todo
script freeze in the loop
True parallel Schwarz example
Thank you to F. Nataf
This is a explanation of the two examples MPI-GMRES 2D and MPI-GMRES 3D, a Schwarz parallel with a complexity almost independent of the number of process (with a coarse grid preconditioner).
To solve the following Poisson problem on domain \(\Omega\) with boundary \(\Gamma\) in \(L^2(\Omega)\) :
where \(f\) and \(g\) are two given functions of \(L^2(\Omega)\) and of \(H^{\frac12}(\Gamma)\),
Lets introduce \((\pi_i)_{i=1,.., N_p}\) a regular partition of the unity of \(\Omega\), q-e-d:
Denote \(\Omega_i\) the sub domain which is the support of \(\pi_i\) function and also denote \(\Gamma_i\) the boundary of \(\Omega_i\).
The parallel Schwarz method is:
Let \(\ell=0\) the iterator and a initial guest \(u^0\) respecting the boundary condition (i.e. \(u^0_{|\Gamma} = g\)).
After discretization with the Lagrange finite element method, with a compatible mesh \({\mathcal{T}_h}_i\) of \(\Omega_i\), i. e., the exist a global mesh \({\mathcal{T}_h}\) such that \({\mathcal{T}_h}_i\) is include in \({\mathcal{T}_h}\).
Let us denote:
\({V_h}_i\) the finite element space corresponding to domain \(\Omega_i\),
\({\mathcal{N}_h}_i\) is the set of the degree of freedom \(\sigma_i^k\),
\({\mathcal{N}^{\Gamma_i}_{hi}}\) is the set of the degree of freedom of \({V_h}_i\) on the boundary \(\Gamma_i\) of \(\Omega_i\),
\(\sigma_i^k({v_h})\) is the value the degree of freedom \(k\),
\({V_{0h}}_i= \{ {v_h} \in {V_h}_i :\forall k \in {\mathcal{N}^{\Gamma_i}_{hi}}, \quad \sigma_i^k({v_h})=0 \}\),
The conditional expression \(a\;?\;b:c\) is defined like in :c`C` of
C++
language by\[\begin{split}a?b: c \equiv \left\{ \begin{array}{l} \mbox{if } a \mbox{ is true then return b}\\ \mbox{else return } c\\ \end{array} \right..\end{split}\]
Note
We never use finite element space associated to the full domain \(\Omega\) because it is too expensive.
We have to defined to operator to build the previous algorithm:
We denote \({u_h^{\ell}}_{|i}\) the restriction of \(u_h^\ell\) on \({V_h}_i\), so the discrete problem on \(\Omega_i\) of problem (37) is find \({u_h^{\ell}}_{i}\in {V_h}_i\) such that:
where \(g_i^k\) is the value of \(g\) associated to the degree of freedom \(k\in {\mathcal{N}^{\Gamma_i}_{hi}}\).
In FreeFEM, it can be written has with U
is the vector corresponding to \({u_h^{\ell}}_{|i}\) and the vector U1
is the vector corresponding to \({u_h^{\ell}}_{i}\) is the solution of:
1real[int] U1(Ui.n);
2real[int] b = onG .* U;
3b = onG ? b : Bi ;
4U1 = Ai^-1*b;
where \(\mathtt{onG}[i] =(i \in \Gamma_i\setminus\Gamma) ? 1 : 0\), and \(\mathtt{Bi}\) the right of side of the problem, are defined by
1// Fespace
2fespace Whi(Thi, P2);
3
4// Problem
5varf vPb (U, V)
6 = int3d(Thi)(
7 grad(U)'*grad(V)
8 )
9 + int3d(Thi)(
10 F*V
11 )
12 + on(1, U=g)
13 + on(10, U=G)
14 ;
15
16varf vPbon (U, V) = on(10, U=1) + on(1, U=0);
17
18matrix Ai = vPb (Whi, Whi, solver=sparsesolver);
19real[int] onG = vPbon(0, Whi);
20real[int] Bi=vPb(0, Whi);
where the FreeFEM label of \(\Gamma\) is 1 and the label of \(\Gamma_i\setminus \Gamma\) is \(10\).
To build the transfer/update part corresponding to (38) equation on process \(i\), let us call njpart
the number the neighborhood of domain of \(\Omega_i\) (i.e: \(\pi_j\) is none \(0\) of \(\Omega_i\)), we store in an array jpart
of size njpart
all this neighborhood.
Let us introduce two array of matrix, Smj[j]
to defined the vector to send from \(i\) to \(j\) a neighborhood process, and the matrix \(rMj[j]\) to after to reduce owith neighborhood \(j\) domain.
So the tranfert and update part compute \(v_i= \pi_i u_i + \sum_{j\in J_i} \pi_j u_j\) and can be write the FreeFEM function Update:
1func bool Update (real[int] &ui, real[int] &vi){
2 int n = jpart.n;
3 for (int j = 0; j < njpart; ++j) Usend[j][] = sMj[j]*ui;
4 mpiRequest[int] rq(n*2);
5 for (int j = 0; j < n; ++j) Irecv(processor(jpart[j], comm,rq[j]), Ri[j][]);
6 for (int j = 0; j < n; ++j) Isend(processor(jpart[j], comm, rq[j+n]), Si[j][]);
7 for (int j = 0; j < n*2; ++j) int k = mpiWaitAny(rq);
8 // apply the unity local partition
9 vi = Pii*ui; //set to pi_i u_i
10 for (int j = 0; j < njpart; ++j) vi += rMj[j]*Vrecv[j][]; //add pi_j u_j
11 return true;
12}
where the buffer are defined by:
1InitU(njpart, Whij, Thij, aThij, Usend) //defined the send buffer
2InitU(njpart, Whij, Thij, aThij, Vrecv) //defined the revc buffer
with the following macro definition:
1macro InitU(n, Vh, Th, aTh, U) Vh[int] U(n); for (int j = 0; j < n; ++j){Th = aTh[j]; U[j] = 0;}
First GMRES algorithm: you can easily accelerate the fixed point algorithm by using a parallel GMRES algorithm after the introduction the following affine \(\mathcal{A}_i\) operator sub domain \(\Omega_i\).
1func real[int] DJ0 (real[int]& U){
2 real[int] V(U.n), b = onG .* U;
3 b = onG ? b : Bi ;
4 V = Ai^-1*b;
5 Update(V, U);
6 V -= U;
7 return V;
8}
Where the parallel MPIGMRES
or MPICG
algorithm is just a simple way to solve in parallel the following \(A_i x_i = b_i, i = 1, .., N_p\) by just changing the dot product by reduce the local dot product of all process with the following MPI code:
1template<class R> R ReduceSum1(R s, MPI_Comm *comm){
2 R r = 0;
3 MPI_Allreduce(&s, &r, 1, MPI_TYPE<R>::TYPE(), MPI_SUM, *comm );
4 return r;
5}
This is done in MPIGC
dynamics library tool.
Second GMRES algorithm: Use scharwz algorithm as a preconditioner of basic GMRES method to solving the parallel problem.
1func real[int] DJ (real[int]& U){ //the original problem
2 ++kiter;
3 real[int] V(U.n);
4 V = Ai*U;
5 V = onGi ? 0.: V; //remove boundary term
6 return V;
7}
8
9func real[int] PDJ (real[int]& U){ //the preconditioner
10 real[int] V(U.n);
11 real[int] b = onG ? 0. : U;
12 V = Ai^-1*b;
13 Update(V, U);
14 return U;
15}
Third GMRES algorithm: Add a coarse solver to the previous algorithm
First build a coarse grid on processor 0, and the
1matrix AC, Rci, Pci;
2if (mpiRank(comm) == 0)
3 AC = vPbC(VhC, VhC, solver=sparsesolver); //the coarse problem
4
5Pci = interpolate(Whi, VhC); //the projection on coarse grid
6Rci = Pci'*Pii; //the restriction on Process i grid with the partition pi_i
7
8func bool CoarseSolve (real[int]& V, real[int]& U, mpiComm& comm){
9 // solving the coarse problem
10 real[int] Uc(Rci.n), Bc(Uc.n);
11 Uc = Rci*U;
12 mpiReduce(Uc, Bc, processor(0, comm), mpiSUM);
13 if (mpiRank(comm) == 0)
14 Uc = AC^-1*Bc;
15 broadcast(processor(0, comm), Uc);
16 V = Pci*Uc;
17}
The New preconditionner
1func real[int] PDJC (real[int]& U){
2 // Idea: F. Nataf.
3 // 0 ~ (I C1A)(I-C2A) => I ~ - C1AC2A +C1A +C2A
4 // New Prec P= C1+C2 - C1AC2 = C1(I- A C2) +C2
5 // ( C1(I- A C2) +C2 ) Uo
6 // V = - C2*Uo
7 // ....
8 real[int] V(U.n);
9 CoarseSolve(V, U, comm);
10 V = -V; //-C2*Uo
11 U += Ai*V; //U = (I-A C2) Uo
12 real[int] b = onG ? 0. : U;
13 U = Ai^-1*b; //C1( I -A C2) Uo
14 V = U - V;
15 Update(V, U);
16 return U;
17}
The code of the 4 algorithms:
1real epss = 1e-6;
2int rgmres = 0;
3if (gmres == 1){
4 rgmres = MPIAffineGMRES(DJ0, u[], veps=epss, nbiter=300,
5 comm=comm, dimKrylov=100, verbosity=ipart?0: 50);
6 real[int] b = onG .* u[];
7 b = onG ? b : Bi ;
8 v[] = Ai^-1*b;
9 Update(v[], u[]);
10}
11else if (gmres == 2)
12 rgmres = MPILinearGMRES(DJ, precon=PDJ, u[], Bi, veps=epss,
13 nbiter=300, comm=comm, dimKrylov=100, verbosity=ipart?0: 50);
14else if (gmres == 3)
15 rgmres = MPILinearGMRES(DJ, precon=PDJC, u[], Bi, veps=epss,
16 nbiter=300, comm=comm, dimKrylov=100, verbosity=ipart?0: 50);
17else //algo Shwarz for demo
18 for(int iter = 0; iter < 10; ++iter)
19 ...
We have all ingredient to solve in parallel if we have et the partitions of the unity. To build this partition we do:
The initial step on process \(1\) to build a coarse mesh, \({\mathcal{T}_h}^*\) of the full domain, and build the partition \(\pi\) function constant equal to \(i\) on each sub domain \(\mathcal{O}_i, i =1 ,.., N_p\), of the grid with the metis
graph partitioner [KARYPIS1995] and on each process \(i\) in \(1..,N_p\) do
Broadcast from process \(1\), the mesh \({\mathcal{T}_h}^*\) (call
Thii
in FreeFEM script), and \(\pi\) function,remark that the characteristic function \(\mathrm{1\!\!I}_{\mathcal{O}_i}\) of domain \(\mathcal{O}_i\), is defined by \((\pi=i)?1:0\),
Let us call \(\Pi^2_P\) (resp. \(\Pi^2_V\)) the \(L^2\) on \(P_h^*\) the space of the constant finite element function per element on \({\mathcal{T}_h}^*\) (resp. \(V_h^*\) the space of the affine continuous finite element per element on \({\mathcal{T}_h}^*\)) and build in parallel the \(\pi_i\) and \(\Omega_i\), such that \(\mathcal{O}_i\ \subset \Omega_i\) where \(\mathcal{O}_i= supp ((\Pi^2_V \Pi^2_C)^m \mathrm{1\!\!I}_{O_i})\), and \(m\) is a the overlaps size on the coarse mesh (generally one), (this is done in function
AddLayers(Thii,suppii[],nlayer,phii[]);
We choose a function \(\pi^*_i = (\Pi^2_1 \Pi^2_0)^m \mathrm{1\!\!I}_{\mathcal{O}_i}\) so the partition of the unity is simply defined by\[\pi_i = \frac{\pi_i^*}{\sum_{j=1}^{N_p} \pi_j^*}\]The set \(J_i\) of neighborhood of the domain \(\Omega_i\), and the local version on \(V_{hi}\) can be defined the array
jpart
andnjpart
with:1Vhi pii = piistar; 2Vhi[int] pij(npij); //local partition of 1 = pii + sum_j pij[j] 3int[int] jpart(npart); 4int njpart = 0; 5Vhi sumphi = piistar; 6for (int i = 0; i < npart; ++i) 7 if (i != ipart){ 8 if (int3d(Thi)(pijstar,j) > 0){ 9 pij[njpart] = pijstar; 10 sumphi[] += pij[njpart][]; 11 jpart[njpart++] = i; 12 } 13 } 14pii[] = pii[] ./ sumphi[]; 15for (int j = 0; j < njpart; ++j) 16pij[j][] = pij[j][] ./ sumphi[]; 17jpart.resize(njpart);
We call \({\mathcal{T}_h}^*_{ij}\) the sub mesh part of \({\mathcal{T}_h}_i\) where \(\pi_j\) are none zero. And thanks to the function
trunc
to build this array,1for(int jp = 0; jp < njpart; ++jp) 2 aThij[jp] = trunc(Thi, pij[jp] > 1e-10, label=10);
At this step we have all on the coarse mesh, so we can build the fine final mesh by splitting all meshes:
Thi, Thij[j], Thij[j]
with FreeFEMtrunc
mesh function which do restriction and slipping.The construction of the send/recv matrices
sMj
and freefem:`rMj: can done with this code:1mesh3 Thij = Thi; 2fespace Whij(Thij, Pk); 3matrix Pii; Whi wpii = pii; Pii = wpii[]; //Diagonal matrix corresponding X pi_i 4matrix[int] sMj(njpart), rMj(njpart); //M send/recive case 5for (int jp = 0; jp < njpart; ++jp){ 6 int j = jpart[jp]; 7 Thij = aThij[jp]; //change mesh to change Whij, Whij 8 matrix I = interpolate(Whij, Whi); //Whij <- Whi 9 sMj[jp] = I*Pii; //Whi -> s Whij 10 rMj[jp] = interpolate(Whij, Whi, t=1); //Whij -> Whi 11}
To buil a not too bad application, all variables come from parameters value with the following code
1include "getARGV.idp"
2verbosity = getARGV("-vv", 0);
3int vdebug = getARGV("-d", 1);
4int ksplit = getARGV("-k", 10);
5int nloc = getARGV("-n", 25);
6string sff = getARGV("-p, ", "");
7int gmres = getARGV("-gmres", 3);
8bool dplot = getARGV("-dp", 0);
9int nC = getARGV("-N", max(nloc/10, 4));
And small include to make graphic in parallel of distribute solution of vector \(u\) on mesh \(T_h\) with the following interface:
1include "MPIplot.idp"
2func bool plotMPIall(mesh &Th, real[int] &u, string cm){
3 PLOTMPIALL(mesh, Pk, Th, u, {cmm=cm, nbiso=20, fill=1, dim=3, value=1});
4 return 1;
5}
Note
The cmm=cm, ...
in the macro argument is a way to quote macro argument so the argument is cmm=cm, ...
.
Parallel sparse solvers
Parallel sparse solvers use several processors to solve linear systems of equation. Like sequential, parallel linear solvers can be direct or iterative. In FreeFEM both are available.
Using parallel sparse solvers in FreeFEM
We recall that the solver
parameters are defined in the following commands: solve
, problem
, set
(setting parameter of a matrix) and in the construction of the matrix corresponding to a bilinear form.
In these commands, the parameter solver
must be set to sparsesolver
for parallel sparse solver.
We have added specify parameters to these command lines for parallel sparse solvers.
These are:
lparams
: vector of integer parameters (l
is for theC++
typelong
)dparams
: vector of real parameterssparams
: string parametersdatafilename
: name of the file which contains solver parameters
The following four parameters are only for direct solvers and are vectors. These parameters allow the user to preprocess the matrix (see the section on sparse direct solver for more information).
permr
: row permutation (integer vector)permc
: column permutation or inverse row permutation (integer vector)scaler
: row scaling (real vector)scalec
: column scaling (real vector)
There are two possibilities to control solver parameters.
The first method defines parameters with lparams
, dparams
and sparams
in .edp
file.
The second one reads the solver parameters from a data file. The name of this file is specified by datafilename
.
If lparams
, dparams
, sparams
or datafilename
is not provided by the user, the solver’s default values are used.
To use parallel solver in FreeFEM, we need to load the dynamic library corresponding to this solver.
For example to use MUMPS solver as parallel solver in FreeFEM, write in the .edp
file load "MUMPS_FreeFem"
.
If the libraries are not loaded, the default sparse solver will be loaded (default sparse solver is UMFPACK
). The Table 2 gives this new value for the different libraries.
We also add functions (see Table 3) with no parameter to change the default sparse solver in the .edp
file.
To use these functions, we need to load the library corresponding to the solver.
An example of using different parallel sparse solvers for the same problem is given in Direct solvers example.
Tip
Test direct solvers
1load "MUMPS_FreeFem"
2//default solver: real-> MUMPS, complex -> MUMPS
3load "real_SuperLU_DIST_FreeFem"
4//default solver: real-> SuperLU_DIST,
5complex -> MUMPS load "real_pastix_FreeFem"
6//default solver: real-> pastix, complex -> MUMPS
7
8// Solving with pastix
9{
10 matrix A =
11 [[1, 2, 2, 1, 1],
12 [ 2, 12, 0, 10, 10],
13 [ 2, 0, 1, 0, 2],
14 [ 1, 10, 0, 22, 0.],
15 [ 1, 10, 2, 0., 22]];
16
17 real[int] xx = [1, 32, 45, 7, 2], x(5), b(5), di(5);
18 b = A*xx;
19 cout << "b =" << b << endl; cout << "xx =" << xx << endl;
20
21 set(A, solver=sparsesolver, datafilename="ffpastix_iparm_dparm.txt");
22 cout << "solve" << endl;
23 x = A^-1*b;
24 cout << "b =" << b << endl;
25 cout << "x =" << endl;
26 cout << x << endl;
27 di = xx - x;
28 if (mpirank == 0){
29 cout << "x-xx =" << endl;
30 cout << "Linf =" << di.linfty << ", L2 =" << di.l2 << endl;
31 }
32}
33
34// Solving with SuperLU_DIST
35realdefaulttoSuperLUdist();
36//default solver: real-> SuperLU_DIST, complex -> MUMPS
37{
38 matrix A =
39 [[1, 2, 2, 1, 1],
40 [ 2, 12, 0, 10, 10],
41 [ 2, 0, 1, 0, 2],
42 [ 1, 10, 0, 22, 0.],
43 [ 1, 10, 2, 0., 22]];
44
45 real[int] xx = [1, 32, 45, 7, 2], x(5), b(5), di(5);
46 b = A*xx;
47 cout << "b =" << b << endl;
48 cout << "xx =" << xx << endl;
49
50 set(A, solver=sparsesolver, datafilename="ffsuperlu_dist_fileparam.txt");
51 cout << "solve" << endl;
52 x = A^-1*b;
53 cout << "b =" << b << endl;
54 cout << "x =" << endl;
55 cout << x << endl;
56 di = xx - x;
57 if (mpirank == 0){
58 cout << "x-xx =" << endl;
59 cout << "Linf =" << di.linfty << ", L2 =" << di.l2 << endl;
60 }
61}
62
63// Solving with MUMPS
64defaulttoMUMPS();
65//default solver: real-> MUMPS, complex -> MUMPS
66{
67 matrix A =
68 [[1, 2, 2, 1, 1],
69 [ 2, 12, 0, 10, 10],
70 [ 2, 0, 1, 0, 2],
71 [ 1, 10, 0, 22, 0.],
72 [ 1, 10, 2, 0., 22]];
73
74 real[int] xx = [1, 32, 45, 7, 2], x(5), b(5), di(5);
75 b = A*xx;
76 cout << "b =" << b << endl;
77 cout << "xx =" << xx << endl;
78
79 set(A, solver=sparsesolver, datafilename="ffmumps_fileparam.txt");
80 cout << "solving solution" << endl;
81 x = A^-1*b;
82 cout << "b =" << b << endl;
83 cout << "x =" << endl;
84 cout << x << endl;
85 di = xx - x;
86 if (mpirank == 0){
87 cout << "x-xx =" << endl;
88 cout << "Linf =" << di.linfty << ", L2" << di.l2 << endl;
89 }
90}
Sparse direct solver
In this section, we present the sparse direct solvers interfaced with FreeFEM.
MUMPS solver
MUltifrontal Massively Parallel Solver (MUMPS) is an open-source library.
This package solves linear system of the form \(A \: x = b\) where \(A\) is a square sparse matrix with a direct method. The square matrix considered in MUMPS can be either unsymmetric, symmetric positive definite or general symmetric.
The method implemented in MUMPS is a direct method based on a multifrontal approach. It constructs a direct factorization \(A \:= \: L\:U\), \(A\: = \: L^t \: D \: L\) depending of the symmetry of the matrix \(A\).
Warning
MUMPS does not solve linear system with a rectangular matrix.
MUMPS parameters:
There are four input parameters in MUMPS.
Two integers SYM
and PAR
, a vector of integer of size 40 INCTL
and a vector of real of size 15 CNTL
.
The first parameter gives the type of the matrix: 0 for unsymmetric matrix, 1 for symmetric positive matrix and 2 for general symmetric.
The second parameter defined if the host processor work during the factorization and solves steps : PAR=1
host processor working and PAR=0
host processor not working.
The parameter INCTL
and CNTL
is the control parameter of MUMPS.
The vectors ICNTL
and CNTL
in MUMPS becomes with index 1 like vector in Fortran
.
For more details see the MUMPS user’s guide.
We describe now some elements of the main parameters of ICNTL
for MUMPS.
- Input matrix parameter The input matrix is controlled by parameters
ICNTL(5)
andICNTL(18)
. The matrix format (resp. matrix pattern and matrix entries) are controlled by
INCTL(5)
(resp.INCTL(18)
).The different values of
ICNTL(5)
are 0 for assembled format and 1 for element format. In the current release of FreeFEM, we consider that FE matrix or matrix is storage in assembled format. Therefore,INCTL(5)
is treated as 0 value.The main option for
ICNTL(18)
:INCLTL(18)=0
centrally on the host processor,ICNTL(18)=3
distributed the input matrix pattern and the entries (recommended option for distributed matrix by developer of MUMPS). For other values ofICNTL(18)
see the MUMPS user’s guide. These values can be used also in FreeFEM.The default option implemented in FreeFEM are
ICNTL(5)=0
andICNTL(18)=0
.
- Input matrix parameter The input matrix is controlled by parameters
Preprocessing parameter The preprocessed matrix \(A_{p}\) that will be effectively factored is defined by
\[A_{p} = P \: D_r \: A \: Q_c \ D_c P^t\]where \(P\) is the permutation matrix, \(Q_c\) is the column permutation, \(D_r\) and \(D_c\) are diagonal matrix for respectively row and column scaling.
The ordering strategy to obtain \(P\) is controlled by parameter
ICNTL(7)
. The permutation of zero free diagonal \(Q_c\) is controlled by parameterICNTL(6)
. The row and column scaling is controlled by parameterICNTL(18)
. These option are connected and also strongly related withICNTL(12)
(see the MUMPS user’s guide for more details).The parameters
permr
,scaler
, andscalec
in FreeFEM allow to give permutation matrix(\(P\)), row scaling (\(D_r\)) and column scaling (\(D_c\)) of the user respectively.
Calling MUMPS in FreeFEM
To call MUMPS in FreeFEM, we need to load the dynamic library MUMPS_freefem.dylib
(MacOSX), MUMPS_freefem.so
(Unix) or MUMPS_freefem.dll
(Windows).
This is done in typing load "MUMPS_FreeFem"
in the .edp
file. We give now the two methods to give the option of MUMPS solver in FreeFEM.
- Solver parameters is defined in .edp file: In this method, we need to give the parameters
lparams
anddparams
. These parameters are defined for MUMPS by :
lparams[0] = SYM
,lparams[1] = PAR
,\(\forall i\) = 1,…,40,
lparams[i+1] = ICNTL(i)
\(\forall i\) = 1,…,15,
dparams[i-1] = CNTL(i)
- Solver parameters is defined in .edp file: In this method, we need to give the parameters
Reading solver parameters on a file:
The structure of data file for MUMPS in FreeFEM is : first line parameter
SYM
and second line parameterPAR
and in the following line the different value of vectorsICNTL
andCNTL
. An example of this parameter file is given inffmumpsfileparam.txt
.10 /* SYM :: 0 for non symmetric matrix, 1 for symmetric definite positive matrix and 2 general symmetric matrix*/ 21 /* PAR :: 0 host not working during factorization and solves steps, 1 host working during factorization and solves steps*/ 3-1 /* ICNTL(1) :: output stream for error message */ 4-1 /* ICNTL(2) :: output for diagnostic printing, statics and warning message */ 5-1 /* ICNTL(3) :: for global information */ 60 /* ICNTL(4) :: Level of printing for error, warning and diagnostic message */ 70 /* ICNTL(5) :: matrix format : 0 assembled format, 1 elemental format. */ 87 /* ICNTL(6) :: control option for permuting and/or scaling the matrix in analysis phase */ 93 /* ICNTL(7) :: pivot order strategy : AMD, AMF, metis, pord scotch*/ 1077 /* ICNTL(8) :: Row and Column scaling strategy */ 111 /* ICNTL(9) :: 0 solve Ax = b, 1 solve the transposed system A^t x = b : parameter is not considered in the current release of FreeFEM*/ 120 /* ICNTL(10) :: number of steps of iterative refinement */ 130 /* ICNTL(11) :: statics related to linear system depending on ICNTL(9) */ 141 /* ICNTL(12) :: constrained ordering strategy for general symmetric matrix */ 150 /* ICNTL(13) :: method to control splitting of the root frontal matrix */ 1620 /* ICNTL(14) :: percentage increase in the estimated working space (default 20\%)*/ 170 /* ICNTL(15) :: not used in this release of MUMPS */ 180 /* ICNTL(16) :: not used in this release of MUMPS */ 190 /* ICNTL(17) :: not used in this release of MUMPS */ 203 /* ICNTL(18) :: method for given : matrix pattern and matrix entries : */ 210 /* ICNTL(19) :: method to return the Schur complement matrix */ 220 /* ICNTL(20) :: right hand side form ( 0 dense form, 1 sparse form) : parameter will be set to 0 for FreeFEM */ 230 /* ICNTL(21) :: 0, 1 kept distributed solution : parameter is not considered in the current release of FreeFEM */ 240 /* ICNTL(22) :: controls the in-core/out-of-core (OOC) facility */ 250 /* ICNTL(23) :: maximum size of the working memory in Megabyte than MUMPS can allocate per working processor */ 260 /* ICNTL(24) :: control the detection of null pivot */ 270 /* ICNTL(25) :: control the computation of a null space basis */ 280 /* ICNTL(26) :: This parameter is only significant with Schur option (ICNTL(19) not zero). : parameter is not considered in the current release of FreeFEM */ 29-8 /* ICNTL(27) (Experimental parameter subject to change in next release of MUMPS) :: control the blocking factor for multiple righthand side during the solution phase : parameter is not considered in the current release of FreeFEM */ 300 /* ICNTL(28) :: not used in this release of MUMPS*/ 310 /* ICNTL(29) :: not used in this release of MUMPS*/ 320 /* ICNTL(30) :: not used in this release of MUMPS*/ 330 /* ICNTL(31) :: not used in this release of MUMPS*/ 340 /* ICNTL(32) :: not used in this release of MUMPS*/ 350 /* ICNTL(33) :: not used in this release of MUMPS*/ 360 /* ICNTL(34) :: not used in this release of MUMPS*/ 370 /* ICNTL(35) :: not used in this release of MUMPS*/ 380 /* ICNTL(36) :: not used in this release of MUMPS*/ 390 /* ICNTL(37) :: not used in this release of MUMPS*/ 400 /* ICNTL(38) :: not used in this release of MUMPS*/ 411 /* ICNTL(39) :: not used in this release of MUMPS*/ 420 /* ICNTL(40) :: not used in this release of MUMPS*/ 430.01 /* CNTL(1) :: relative threshold for numerical pivoting */ 441e-8 /* CNTL(2) :: stopping criteria for iterative refinement */ 45-1 /* CNTL(3) :: threshold for null pivot detection */ 46-1 /* CNTL(4) :: determine the threshold for partial pivoting */ 470.0 /* CNTL(5) :: fixation for null pivots */ 480 /* CNTL(6) :: not used in this release of MUMPS */ 490 /* CNTL(7) :: not used in this release of MUMPS */ 500 /* CNTL(8) :: not used in this release of MUMPS */ 510 /* CNTL(9) :: not used in this release of MUMPS */ 520 /* CNTL(10) :: not used in this release of MUMPS */ 530 /* CNTL(11) :: not used in this release of MUMPS */ 540 /* CNTL(12) :: not used in this release of MUMPS */ 550 /* CNTL(13) :: not used in this release of MUMPS */ 560 /* CNTL(14) :: not used in this release of MUMPS */ 570 /* CNTL(15) :: not used in this release of MUMPS */
If no solver parameter is given, we used default option of MUMPS solver.
Tip
MUMPS example
A simple example of calling MUMPS in FreeFEM with this two methods is given in the Test solver MUMPS example.
SuperLU distributed solver
The package SuperLU_DIST solves linear systems using LU factorization. It is a free scientific library
This library provides functions to handle square or rectangular matrix in real and complex arithmetics. The method implemented in SuperLU_DIST is a supernodal method. New release of this package includes a parallel symbolic factorization. This scientific library is written in C and MPI for communications.
SuperLU_DIST parameters:
We describe now some parameters of SuperLU_DIST. The SuperLU_DIST library use a 2D-logical process group. This process grid is specified by \(nprow\) (process row) and \(npcol\) (process column) such that \(N_{p} = nprow \: npcol\) where \(N_{p}\) is the number of all process allocated for SuperLU_DIST.
The input matrix parameters is controlled by “matrix=” in sparams
for internal parameter or in the third line of parameters file.
The different value are
matrix=assembled
global matrix are available on all processmatrix=distributedglobal
The global matrix is distributed among all the processmatrix=distributed
The input matrix is distributed (not yet implemented)
The option arguments of SuperLU_DIST are described in the section Users-callable routine of the SuperLU users’ guide.
The parameter Fact
and TRANS
are specified in FreeFEM interfaces to SuperLU_DIST during the different steps.
For this reason, the value given by the user for this option is not considered.
The factorization LU is calculated in SuperLU_DIST on the matrix \(A_p\).
where \(P_c\) and \(P_r\) is the row and column permutation matrix respectively, \(D_r\) and \(D_c\) are diagonal matrix for respectively row and column scaling.
The option argument RowPerm
(resp. ColPerm
) control the row (resp. column) permutation matrix.
\(D_r\) and \(D_c\) is controlled by the parameter DiagScale
.
The parameter permr
, permc
, scaler
, and scalec
in FreeFEM is provided to give row permutation, column permutation, row scaling and column scaling of the user respectively.
The other parameters for LU factorization are ParSymFact
and ReplaceTinyPivot
.
The parallel symbolic factorization works only on a power of two processes and need the ParMetis
ordering.
The default option argument of SuperLU_DIST are given in the file ffsuperlu_dist_fileparam.txt
.
Calling SuperLU_DIST in FreeFEM
To call SuperLU_DIST in FreeFEM, we need to load the library dynamic correspond to interface.
This done by the following line load "real_superlu _DIST_FreeFem"
(resp. load "complex_superlu_DIST_FreeFem"
) for real (resp. complex) arithmetics in the file .edp
.
Solver parameters is defined in .edp file:
To call SuperLU_DIST with internal parameter, we used the parameters sparams
.
The value of parameters of SuperLU_DIST in sparams
are defined by :
nprow=1
,npcol=1
,matrix= distributedgloba
,Fact= DOFACT
,Equil=NO
,ParSymbFact=NO
,ColPerm= MMD_AT_PLUS_A
,RowPerm= LargeDiag
,DiagPivotThresh=1.0
,IterRefine=DOUBLE
,Trans=NOTRANS
,ReplaceTinyPivot=NO
,SolveInitialized=NO
,PrintStat=NO
,DiagScale=NOEQUIL
This value correspond to the parameter in the file ffsuperlu_dist_fileparam.txt
.
If one parameter is not specified by the user, we take the default value of SuperLU_DIST.
Reading solver parameters on a file: The structure of data file for SuperLU_DIST in FreeFEM is given in the file ffsuperlu_dist_fileparam.txt
(default value of the FreeFEM interface).
11 /* nprow : integer value */
21 /* npcol : integer value */
3distributedglobal /* matrix input : assembled, distributedglobal, distributed */
4DOFACT /* Fact : DOFACT, SamePattern, SamePattern_SameRowPerm, FACTORED */
5NO /* Equil : NO, YES */
6NO /* ParSymbFact : NO, YES */
7MMD_AT_PLUS_A /* ColPerm : NATURAL, MMD_AT_PLUS_A, MMD_ATA, METIS_AT_PLUS_A, PARMETIS, MY_PERMC */
8LargeDiag /* RowPerm : NOROWPERM, LargeDiag, MY_PERMR */
91.0 /* DiagPivotThresh : real value */
10DOUBLE /* IterRefine : NOREFINE, SINGLE, DOUBLE, EXTRA */
11NOTRANS /* Trans : NOTRANS, TRANS, CONJ*/
12NO /* ReplaceTinyPivot : NO, YES*/
13NO /* SolveInitialized : NO, YES*/
14NO /* RefineInitialized : NO, YES*/
15NO /* PrintStat : NO, YES*/
16NOEQUIL /* DiagScale : NOEQUIL, ROW, COL, BOTH*/
If no solver parameter is given, we used default option of SuperLU_DIST solver.
Tip
A simple example of calling SuperLU_DIST in FreeFEM with this two methods is given in the Solver superLU_DIST example.
PaStiX solver
PaStiX (Parallel Sparse matrix package) is a free scientific library under CECILL-C license. This package solves sparse linear system with a direct and block ILU(k) iterative methods. his solver can be applied to a real or complex matrix with a symmetric pattern.
PaStiX parameters:
The input matrix
parameter of FreeFEM depend on PaStiX interface.
matrix = assembled
for non distributed matrix.
It is the same parameter for SuperLU_DIST.
There are four parameters in PaStiX : iparm
, dparm
, perm
and invp
.
These parameters are respectively the integer parameters (vector of size 64), real parameters (vector of size 64), permutation matrix and inverse permutation matrix respectively.
iparm
and dparm
vectors are described in PaStiX RefCard.
The parameters permr
and permc
in FreeFEM are provided to give permutation matrix and inverse permutation matrix of the user respectively.
Solver parameters defined in .edp file:
To call PaStiX in FreeFEM in this case, we need to specify the parameters lparams
and dparams
.
These parameters are defined by :
\(\forall i\) = 0,… ,63, lparams[i] = iparm[i]
.
\(\forall i\) = 0,… ,63, dparams[i] = dparm[i]
.
Reading solver parameters on a file:
The structure of data file for PaStiX parameters in FreeFEM is: first line structure parameters of the matrix and in the following line the value of vectors iparm
and dparm
in this order.
1assembled /* matrix input :: assembled, distributed global and distributed */
2iparm[0]
3iparm[1]
4...
5...
6iparm[63]
7dparm[0]
8dparm[1]
9...
10...
11dparm[63]
An example of this file parameter is given in ffpastix_iparm_dparm.txt
with a description of these parameters.
This file is obtained with the example file iparm.txt
and dparm.txt
including in the PaStiX package.
If no solver parameter is given, we use the default option of PaStiX solver.
Tip
A simple example of calling PaStiX in FreeFEM with this two methods is given in the Solver PaStiX example.
In Table 4, we recall the different matrix considering in the different direct solvers.
direct solver |
square matrix |
rectangular matrix |
||||
---|---|---|---|---|---|---|
sym |
sym pattern |
unsym |
sym |
sym pattern |
unsym |
|
SuperLU_DIST |
yes |
yes |
yes |
yes |
yes |
yes |
MUMPS |
yes |
yes |
yes |
no |
no |
no |
Pastix |
yes |
yes |
no |
no |
no |
no |
Parallel sparse iterative solver
Concerning iterative solvers, we have chosen pARMS, HIPS and Hypre.
Each software implements a different type of parallel preconditioner.
So, pARMS implements algebraic domain decomposition preconditioner type such as additive Schwartz [CAI1989] and interface method; while HIPS implement hierarchical incomplete factorization and finally HYPRE implements multilevel preconditioner are AMG(Algebraic MultiGrid) and parallel approximated inverse.
To use one of these programs in FreeFEM, you have to install it independently of FreeFEM. It is also necessary to install the MPI communication library which is essential for communication between the processors and, in some cases, software partitioning graphs like METIS or Scotch.
All this preconditioners are used with Krylov subspace methods accelerators.
Krylov subspace methods are iterative methods which consist in finding a solution \(x\) of linear system \(Ax=b\) inside the affine space \(x_0+K_m\) by imposing that \(b-Ax \bot \mathcal{L}_m\), where \(K_m\) is Krylov subspace of dimension \(m\) defined by \(K_m=\{r_0, Ar_0, A^2r_0,...,A^{m-1}r_0\}\) and \(\mathcal{L}_m\) is another subspace of dimension \(m\) which depends on type of Krylov subspace. For example in GMRES, \(\mathcal{L}_m=AK_m\).
We realized an interface which is easy to use, so that the call of these different softwares in FreeFEM is done in the same way. You just have to load the solver and then specify the parameters to apply to the specific solvers. In the rest of this chapter, when we talk about Krylov subspace methods we mean one among GMRES, CG and BICGSTAB.
pARMS solver
pARMS (parallel Algebraic Multilevel Solver) is a software developed by Youssef Saad and al at University of Minnesota.
This software is specialized in the resolution of large sparse non symmetric linear systems of equation. Solvers developed in pARMS are of type “Krylov’s subspace”.
It consists of variants of GMRES like FGMRES (Flexible GMRES), DGMRES (Deflated GMRES) [SAAD2003] and BICGSTAB. pARMS also implements parallel preconditioner like RAS (Restricted Additive Schwarz) [CAI1989] and Schur Complement type preconditioner.
All these parallel preconditioners are based on the principle of domain decomposition. Thus, the matrix \(A\) is partitioned into sub matrices \(A_i\)(\(i=1,...,p\)) where p represents the number of partitions one needs. The union of \(A_i\) forms the original matrix. The solution of the overall system is obtained by solving the local systems on \(A_i\) (see [SMITH1996]). Therefore, a distinction is made between iterations on \(A\) and the local iterations on \(A_i\).
To solve the local problem on \(A_i\) there are several preconditioners as ilut (Incomplete LU with threshold), iluk (Incomplete LU with level of fill in) and ARMS (Algebraic Recursive Multilevel Solver).
Tip
Default parameters
1load "parms_FreeFem" //Tell FreeFem that you will use pARMS
2
3// Mesh
4border C(t=0, 2*pi){x=cos(t); y=sin(t); label=1;}
5mesh Th = buildmesh (C(50));
6
7// Fespace
8fespace Vh(Th, P2); Vh u, v;
9
10// Function
11func f= x*y;
12
13// Problem
14problem Poisson (u, v, solver=sparsesolver)
15 = int2d(Th)(
16 dx(u)*dx(v)
17 + dy(u)*dy(v) )
18 + int2d(Th)(
19 - f*v
20 )
21 + on(1, u=0) ;
22
23// Solve
24real cpu = clock();
25Poisson;
26cout << " CPU time = " << clock()-cpu << endl;
27
28// Plot
29plot(u);
In line 1, the pARMS dynamic library is loaded with interface FreeFEM. After this, in line 15 we specify that the bilinear form will be solved by the last sparse linear solver load in memory which, in this case, is pARMS.
The parameters used in pARMS in this case are the default one since the user does not have to provide any parameter.
Note
In order to see the plot of a parallel script, run the command FreeFem++-mpi -glut ffglut script.edp
Here are some default parameters:
solver=FGMRES
,Krylov dimension=30
,Maximum of Krylov=1000
,Tolerance for convergence=1e-08
(see book [SAAD2003] to understand all this parameters),preconditionner=Restricted Additif Schwarz
[CAI1989],Inner Krylov dimension=5
,Maximum of inner Krylov dimension=5
,Inner preconditionner=ILUK
.
To specify the parameters to apply to the solver, the user can either give an integer vector for integer parameters and real vectors for real parameters or provide a file which contains those parameters.
Tip
User specifies parameters inside two vectors
Lets us consider Navier-Stokes example. In this example we solve linear systems coming from discretization of Navier-Stokes equations with pARMS. Parameters of solver is specified by user.
1load "parms_FreeFem"
2
3// Parameters
4real nu = 1.;
5int[int] iparm(16);
6real[int] dparm(6);
7for (int ii = 0; ii < 16; ii++)
8 iparm[ii] = -1;
9for (int ii = 0; ii < 6; ii++)
10 dparm[ii] = -1.0; iparm[0]=0;
11
12// Mesh
13mesh Th = square(10, 10);
14int[int] wall = [1, 3];
15int inlet = 4;
16
17// Fespace
18fespace Vh(Th, [P2, P2, P1]);
19
20// Function
21func uc = 1.;
22
23// Problem
24varf Stokes ([u, v, p], [ush, vsh, psh], solver=sparsesolver)
25 = int2d(Th)(
26 nu*(
27 dx(u)*dx(ush)
28 + dy(u)*dy(ush)
29 + dx(v)*dx(vsh)
30 + dy(v)*dy(vsh)
31 )
32 - p*psh*(1.e-6)
33 - p*(dx(ush) + dy(vsh))
34 - (dx(u) + dy(v))*psh
35 )
36 + on(wall, wall, u=0., v=0.)
37 + on(inlet, u=uc, v=0) ;
38
39matrix AA = Stokes(Vh, Vh);
40set(AA, solver=sparsesolver, lparams=iparm, dparams=dparm); //set pARMS as linear solver
41real[int] bb = Stokes(0, Vh);
42real[int] sol(AA.n);
43sol = AA^-1 * bb;
We need two vectors to specify the parameters of the linear solver.
In line 5-6 of the example, we have declared these vectors(int[int] iparm(16); real[int] dparm(6);
).
In line 7-10 we have initialized these vectors by negative values.
We do this because all parameters values in pARMS are positive and if you do not change the negative values of one entry of this vector, the default value will be set.
In Table 7 and Table 8, we have the meaning of different entries of these vectors.
We run this example on a cluster paradent of Grid5000 and report results in Table 5.
\(n=471281\) \(nnz=13\times10^6\) \(Te=571.29\) |
||||
---|---|---|---|---|
np |
add(iluk) |
shur(iluk) |
||
nit |
time |
nit |
time |
|
4 |
230 |
637.57 |
21 |
557.8 |
8 |
240 |
364.12 |
22 |
302.25 |
16 |
247 |
212.07 |
24 |
167.5 |
32 |
261 |
111.16 |
25 |
81.5 |
|
matrix size |
---|---|
|
number of non null entries inside matrix |
|
number of iteration for convergence |
|
Time for convergence |
|
Time for constructing finite element matrix |
|
number of processor |
In this example, we fix the matrix size (in term of finite element, we fix the mesh) and increase the number of processors used to solve the linear system. We saw that, when the number of processors increases, the time for solving the linear equation decreases, even if the number of iteration increases. This proves that, using pARMS as solver of linear systems coming from discretization of partial differential equation in FreeFEM can decrease drastically the total time of simulation.
Entries of |
Significations of each entries |
---|---|
|
Krylov subspace methods Different values for this parameters are specify on Table 9 |
|
Preconditionner Different preconditionners for this parameters are specify on Table 10 |
|
Krylov subspace dimension in outer iteration: default value 30 |
|
Maximum of iterations in outer iteration: default value 1000 |
|
Number of level in arms when used |
|
Krylov subspace dimension in inner iteration: default value 3 |
|
Maximum of iterations in inner iteration: default value 3 |
|
Symmetric(=1 for symmetric) or unsymmetric matrix: default value 0(unsymmetric matrix) |
|
Overlap size between different subdomain: default value 0(no overlap) |
|
Scale the input matrix or not: Default value 1 (Matrix should be scaled) |
|
Block size in arms when used: default value 20 |
|
lfil0 (ilut, iluk, and arms) : default value 20 |
|
lfil for Schur complement const : default value 20 |
|
lfil for Schur complement const : default value 20 |
|
Multicoloring or not in ILU when used : default value 1 |
|
Inner iteration : default value 0 |
|
Print message when solving: default 0 (no message print)
|
Values of |
Krylov subspace methods |
---|---|
0 |
FGMRES (Flexible GMRES) |
1 |
DGMRES (Deflated GMRES) |
2 |
BICGSTAB |
Interfacing with HIPS
HIPS (Hierarchical Iterative Parallel Solver) is a scientific library that provides an efficient parallel iterative solver for very large sparse linear systems. HIPS is available as free software under the CeCILL-C licence.
HIPS implements two solver classes which are the iteratives class (GMRES, PCG) and the Direct class. Concerning preconditionners, HIPS implements a type of multilevel ILU. For further informations on those preconditionners see the HIPS documentation.
Tip
Laplacian 3D solved with HIPS
Let us consider the 3D Laplacian example inside FreeFEM package where after discretization we want to solve the linear equation with HIPS.
The following example is a Laplacian 3D using Hips as linear solver. We first load Hips solver at line 2. From line 7 to 18 we specify the parameters for the Hips solver and in line 82 we set these parameters in the linear solver.
In Table 11 results of running on Cluster Paradent of Grid5000 are reported. We can see in this running example the efficiency of parallelism.
1load "msh3"
2load "hips_FreeFem" //load Hips library
3
4// Parameters
5int nn = 10;
6real zmin = 0, zmax = 1;
7int[int] iparm(14);
8real[int] dparm(6);
9for (int iii = 0; iii < 14; iii++)
10 iparm[iii] = -1;
11for (int iii = 0; iii < 6; iii++)
12 dparm[iii] = -1;
13iparm[0] = 0; //use iterative solver
14iparm[1] = 1; //PCG as Krylov method
15iparm[4] = 0; //Matrix are symmetric
16iparm[5] = 1; //Pattern are also symmetric
17iparm[9] = 1; //Scale matrix
18dparm[0] = 1e-13; //Tolerance to convergence
19dparm[1] = 5e-4; //Threshold in ILUT
20dparm[2] = 5e-4; //Threshold for Schur preconditionner
21
22// Functions
23func ue = 2*x*x + 3*y*y + 4*z*z + 5*x*y + 6*x*z + 1;
24func uex = 4*x + 5*y + 6*z;
25func uey = 6*y + 5*x;
26func uez = 8*z + 6*x;
27func f = -18.;
28
29// Mesh
30mesh Th2 = square(nn, nn);
31
32int[int] rup = [0,2], rdown=[0, 1];
33int[int] rmid=[1, 1, 2, 1, 3, 1, 4, 1];
34
35mesh3 Th=buildlayers(Th2, nn, zbound=[zmin, zmax], reffacemid=rmid,
36 reffaceup = rup, reffacelow = rdown);
37
38// Fespace
39fespace Vh2(Th2, P2);
40Vh2 ux, uz, p2;
41
42fespace Vh(Th, P2);
43Vh uhe = ue;
44cout << "uhe min =" << uhe[].min << ", max =" << uhe[].max << endl;
45Vh u, v;
46Vh F;
47
48// Macro
49macro Grad3(u) [dx(u), dy(u), dz(u)] //
50
51// Problem
52varf va (u, v)
53 = int3d(Th)(
54 Grad3(v)' * Grad3(u)
55 )
56 + int2d(Th, 2)(
57 u*v
58 )
59 - int3d(Th)(
60 f*v
61 )
62 - int2d(Th, 2)(
63 ue*v + (uex*N.x + uey*N.y + uez*N.z)*v
64 )
65 + on(1, u=ue);
66
67varf l (unused, v) = int3d(Th)(f*v);
68
69real cpu=clock();
70matrix Aa = va(Vh, Vh);
71
72F[] = va(0, Vh);
73
74if (mpirank == 0){
75 cout << "Size of A =" << Aa.n << endl;
76 cout << "Non zero coefficients =" << Aa.nbcoef << endl;
77 cout << "CPU TIME FOR FORMING MATRIX =" << clock()-cpu << endl;
78}
79
80set(Aa, solver=sparsesolver, dparams=dparm, lparams=iparm); //Set hips as linear solver
81
82// Solve
83u[] = Aa^-1*F[];
84
85// Plot
86plot(u);
\(n=4\times 10^6\) \(nnz=118 \times 10^6\) \(Te=221.34\) |
||
---|---|---|
|
|
|
8 |
190 |
120.34 |
16 |
189 |
61.08 |
32 |
186 |
31.70 |
64 |
183 |
23.44 |
Tip
Interfacing with HYPRE
Hypre (High Level Preconditioner) is a suite of parallel preconditioner developed at Lawrence Livermore National Lab.
There are two main classes of preconditioners developed in HYPRE: AMG (Algebraic MultiGrid) and Parasails (Parallel Sparse Approximate Inverse).
Now, suppose we want to solve \(Ax=b\).
At the heart of AMG there is a series of progressively coarser (smaller) representations of the matrix \(A\). Given an approximation \(\hat{x}\) to the solution \(x\), consider solving the residual equation \(Ae=r\) to find the error \(e\), where \(r=b-A\hat{x}\). A fundamental principle of AMG is that it is an algebraically smooth error. To reduce the algebraically smooth errors further, they need to be represented by a smaller defect equation (coarse grid residual equation) \(A_ce_c=r_c\), which is cheaper to solve. After solving this coarse equation, the solution is then interpolated in fine grid represented here by matrix \(A\). The quality of AMG depends on the choice of coarsening and interpolating operators.
The sparse approximate inverse approximates the inverse of a matrix \(A\) by a sparse matrix \(M\). A technical idea to construct matrix \(M\) is to minimize the Frobenuis norm of the residual matrix \(I-MA\). For more details on this preconditioner technics see [CHOW1997].
HYPRE implement three Krylov subspace solvers: GMRES, PCG and BiCGStab.
Tip
Laplacian 3D solved with HYPRE
Let us consider again the 3D Laplacian example inside FreeFEM package where after discretization we want to solve the linear equation with Hypre. The following example is a Laplacian 3D using Hypre as linear solver. This is the same example as Hips one, so we just show here the lines where we set some Hypre parameters.
We first load the Hypre solver at line 2. From line 6 to 18 we specifies the parameters to set to Hypre solver and in line 22 we set parameters to Hypre solver.
It should be noted that the meaning of the entries of these vectors is different from those of Hips.
In the case of HYPRE, the meaning of differents entries of vectors iparm
and dparm
are given in Table 14 to Table 18.
In Table 19 the results of running on Cluster Paradent of Grid5000 are reported. We can see in this running example the efficiency of parallelism, in particular when AMG are use as preconditioner.
1load "msh3"
2load "hipre_FreeFem" //Load Hipre librairy
3
4// Parameters
5int nn = 10;
6int[int] iparm(20);
7real[int] dparm(6);
8for (int iii = 0; iii < 20; iii++)
9 iparm[iii] = -1;
10for (int iii = 0; iii < 6; iii++)
11 dparm[iii] = -1;
12iparm[0] = 2; //PCG as krylov method
13iparm[1] = 0; //AMG as preconditionner 2: if ParaSails
14iparm[7] = 7; //Interpolation
15iparm[9] = 6; //AMG Coarsen type
16iparm[10] = 1; //Measure type
17iparm[16] = 2; //Additive schwarz as smoother
18dparm[0] = 1e-13; //Tolerance to convergence
19dparm[1] = 5e-4; //Threshold
20dparm[2] = 5e-4; //Truncation factor
21
22...
23
24set(Aa, solver=sparsesolver, dparams=dparm, lparams=iparm);
|
Row size in Parallel ILUT: Default=1000 |
|
Set maximum number of iterations: Default=30 |
|
Drop tolerance in Parallel ILUT: Default=1e-5 |
\(n=4\times10^6\) |
\(nnz=13\times10^6\) |
\(Te = 571.29\) |
|
---|---|---|---|
np |
AMG |
||
nit |
time |
||
8 |
6 |
1491.83 |
|
16 |
5 |
708.49 |
|
32 |
4 |
296.22 |
|
64 |
4 |
145.64 |
Conclusion
With the different runs presented here, we wanted to illustrate the gain in time when we increase the number of processors used for the simulations. We saw that in every case the time for the construction of the finite element matrix is constant. This is normal because until now this phase is sequential in FreeFEM. In contrast, phases for solving the linear system are parallel. We saw on several examples presented here that when we increase the number of processors, in general we decrease the time used for solving the linear systems. But this is not true in every case. In several case, when we increase the number of processors the time to convergence also increases. There are two main reasons for this. First, the increase of processors can lead to the increase of volume of exchanged data across processors consequently increasing the time for solving the linear systems.
Furthermore, in decomposition domain type preconditioners, the number of processors generally corresponds to the number of sub domains. In subdomain methods, generally when we increase the number of subdomains we decrease convergence quality of the preconditioner. This can increase the time used for solving linear equations.
To end this, we should note that good use of the preconditioners interfaced in FreeFEM is empiric, because it is difficult to know what is a good preconditioner for some type of problems. Although, the efficiency of preconditioners sometimes depends on how its parameters are set. For this reason we advise the user to pay attention to the meaning of the parameters in the user guide of the iterative solvers interfaced in FreeFEM.
Domain decomposition
In the previous section, we saw that the phases to construct a matrix are sequential. One strategy to construct the matrix in parallel is to divide geometrically the domain into subdomains. In every subdomain we construct a local submatrix and after that we assemble every submatrix to form the global matrix.
We can use this technique to solve PDE directly in domain \(\Omega\). In this case, in every subdomains you have to define artificial boundary conditions to form consistent equations in every subdomains. After this, you solve equation in every subdomains and define a strategy to obtain the global solution.
In terms of parallel programming for FreeFEM, with MPI, this means that the user must be able to divide processors avaible for computation into subgroups of processors and also must be able to realize different type of communications in FreeFEM script. Here is a wrapper of some MPI functions.
Communicators and groups
Groups
mpiGroup grpe(mpiGroup gp, KN_<long>)
: Create MPI_Group from existing group gp
by given vector.
Communicators
Communicators is an abstract MPI object which allows MPI user to communicate across group of processors. Communicators can be Intra-communicators(involves a single group) or Inter-communicators (involves two groups). When we not specify type of communicator it will be Intra-communicators
mpiComm cc(mpiComm comm, mpiGroup gp): Creates a new communicator.
comm
communicator(handle), gp
group which is a subset of the group of comm
(handle).
Return new communicator
mpiComm cc(mpiGroup gp): Same as previous constructor but default comm
here is MPI_COMM_WORLD
.
mpiComm cc(mpiComm comm, int color, int key): Creates new communicators based on colors
and key
.
This constructor is based on MPI_Comm_split routine of MPI.
mpiComm cc(MPIrank p, int key): Same constructor than the last one.
Here colors
and comm
is defined in MPIrank
.
This constructor is based on MPI_Comm_split
routine of MPI.
Tip
Split communicator
1mpiComm comm(mpiCommWorld, 0, 0);
2int color = mpiRank(comm)%2;
3mpiComm ccc(processor(color, comm), 0);
4mpiComm qpp(comm, 0, 0);
5mpiComm cp(ccc, color, 0);
mpiComm cc(mpiComm comm, int high): Creates an intracommunicator from an intercommunicator. comm
intercommunicator, high
.
Used to order the groups within comm
(logical) when creating the new communicator.
This constructor is based on MPI_Intercomm_merge
routine of MPI.
mpiComm cc(MPIrank p1, MPIrank p2, int tag): This constructor creates an intercommuncator from two intracommunicators.
p1
defined local (intra)communicator and rank in local_comm
of leader (often 0) while p2
defined remote communicator and rank in peer_comm
of remote leader (often 0).
tag
Message tag to use in constructing intercommunicator.
This constructor is based on MPI_Intercomm_create
.
Tip
Merge
1mpiComm comm, cc;
2int color = mpiRank(comm)%2;
3int rk = mpiRank(comm);
4int size = mpiSize(comm);
5cout << "Color values: " << color << endl;
6mpiComm ccc(processor((rk<size/2), comm), rk);
7mpiComm cp(cc, color, 0);
8int rleader;
9if (rk == 0){ rleader = size/2; }
10else if (rk == size/2){ rleader = 0; }
11else{ rleader = 3; }
12mpiComm qqp(processor(0, ccc), processor(rleader, comm), 12345);
13int aaa = mpiSize(ccc);
14cout << "Number of processor: " << aaa << endl;
Process
In FreeFEM we wrap MPI process by function call processor
which create internal FreeFEM object call MPIrank
.
This mean that do not use MPIrank
in FreeFEM script.
processor(int rk)
: Keep process rank inside object MPIrank
.
Rank is inside MPI_COMM_WORLD
.
processor(int rk, mpiComm cc)
and processor(mpiComm cc, int rk)
process rank inside communicator cc.
processor(int rk, mpiComm cc)
and processor(mpiComm cc, int rk)
process rank inside communicator cc.
processorblock(int rk)
: This function is exactlly the same than processor(int rk)
but is use in case of blocking communication.
processorblock(int rk, mpiComm cc)
: This function is exactly the same as processor(int rk, mpiComm cc)
but uses a synchronization point.
Points to Points communicators
In FreeFEM you can call MPI points to points communications functions.
Send(processor(int rk, mpiComm cc), Data D)
: Blocking send of Data D
to processor of rank rk
inside communicator cc
.
Note that Data D
can be: int
, real
, complex
, int[int]
, real[int]
, complex[int]
, Mesh
, Mesh3
, Matrix
.
Recv(processor(int rk, mpiComm cc), Data D)
: Receive Data D
from process of rank rk
in communicator cc
.
Note that Data D
can be: int
, real
, complex
, int[int]
, real[int]
, complex[int]
, Mesh
, Mesh3
, Matrix
and should be the same type than corresponding send.
Isend(processor(int rk, mpiComm cc), Data D)
: Non blocking send of Data D
to processor of rank rk
inside communicator cc
.
Note that Data D
can be: int
, real
, complex
, int[int]
, real[int]
, complex[int]
, mesh
, mesh3
, matrix
.
Recv(processor(int rk, mpiComm cc), Data D)
: Receive corresponding to send.
Global operations
In FreeFEM you can call MPI global communication functions.
broadcast(processor(int rk, mpiComm cc), Data D)
: Process rk
Broadcast Data D
to all process inside communicator cc
.
Note that Data D
can be: int
, real
, complex
, int[int]
, real[int]
, complex[int]
, Mesh
, Mesh3
, Matrix
.
broadcast(processor(int rk), Data D)
: Process rk
Broadcast Data D
to all process inside MPI_COMM_WORLD
.
Note that Data D
can be: int
, real
, complex
, int[int]
, real[int]
, complex[int]
, Mesh
, Mesh3
, Matrix
.
mpiAlltoall(Data a, Data b)
: Sends data a
from all to all processes.
Receive buffer is Data b
.
This is done inside communicator MPI_COMM_WORLD
.
mpiAlltoall(Data a, Data b, mpiComm cc)
: Sends data a
from all to all processes. Receive buffer is Data b
.
This is done inside communicator cc
.
mpiGather(Data a, Data b, processor(mpiComm, int rk)
: Gathers together values Data a
from a group of processes.
Process of rank rk
get data on communicator rk
.
This function is like MPI_Gather
.
mpiAllgather(Data a, Data b)
: Gathers Data a
from all processes and distribute it to all in Data b
.
This is done inside communicator MPI_COMM_WORLD
.
This function is like MPI_Allgather
.
mpiAllgather(Data a, Data b, mpiComm cc)
: Gathers Data a
from all processes and distribute it to all in Data b
.
This is done inside communicator cc
.
This function is like MPI_Allgather
.
mpiScatter(Data a,Data b,processor(int rk, mpiComm cc))
: Sends Data a
from one process whith rank rk
to all other processes in group represented by communicator mpiComm cc
.
mpiReduce(Data a, Data b, processor(int rk, mpiComm cc), MPI_Op op)
Reduces values Data a
on all processes to a single value Data b
on process of rank rk
and communicator cc
.
Operation use in reduce is: MPI_Op op
which can be: mpiMAX
, mpiMIN
, mpiSUM
, mpiPROD
, mpiLAND
, mpiLOR
, mpiLXOR
, mpiBAND
, mpiBXOR
, mpiMAXLOC
, mpiMINLOC
.
Note that, for all global operations, only int[int]
and real[int]
are data type take in account in FreeFEM.