Automatic (warm) recovery uses the information stored inside the LIXA state server to decide which transaction must be recovered and how it should be completed (committed or rolled back).
The above paragraphs explain what's happen when automatic recovery starts and completes (rolls back or commits) the transaction marked as “recovery pending”.
An “equivalent” Application Program starts and activates
the LIXA Transaction Manager with tx_open()
.
The LIXA Transaction Manager autonoumosly coordinates the transaction
completion and the Application Program is not aware of this
“under the covers” operation.
From the LIXA Transaction Manager point of view, two Application Programs are equivalent when they are associated to the same job.
The job associated to an Application Program can be:
the content of the environment variable
LIXA_JOB
if it is set
a string computed in this way if the environment variable
LIXA_JOB
is not set:
branch qualifier
+
“/” +
IP address
where branch qualifier
is computed as:
MD5
(lixac_conf.xml
+
$(LIXA_PROFILE
) +
gethostid()
)
An example of branch qualifier
is
“0fc29445b1d4c3f4ed6be2fea20f918b”, while an example of
job automatically associated to an Application Program is
“0fc29445b1d4c3f4ed6be2fea20f918b/127.0.0.1”
If you don't set the environment variable
LIXA_JOB
all the Application Programs that
meet this requirements:
they use a config file (lixac_conf.xml
)
with the same content
they use a LIXA_PROFILE
environment variable
with the same content
they run in a host that returns the same value to
gethostid()
function
they are calling the LIXA state server from the same IP address
are associated to the same job.
To pick-up the job associated to an Application Program you can activate the trace using the bit associated to the label “LIXA_TRACE_MOD_CLIENT_CONFIG”. Take a look to the section called “Tracing modules” for more information. This is an excerpt from the trace:
[...] 2011-12-03 17:00:59.746036 [6021/1078050640] client_config_job 2011-12-03 17:00:59.746073 [6021/1078050640] client_config_job: acquiring exclusive mutex 2011-12-03 17:00:59.746120 [6021/1078050640] client_config_job: 'LIXA_JOB' environment variable not found, computing job string... 2011-12-03 17:00:59.746175 [6021/1078050640] lixa_job_set_source_ip 2011-12-03 17:00:59.746275 [6021/1078050640] lixa_job_set_source_ip/excp=1/ret_cod=0/errno=0 2011-12-03 17:00:59.746339 [6021/1078050640] client_config_job: job value for this process is '0fc29445b1d4c3f4ed6be2fea20f918b/127.0.0.1 ' 2011-12-03 17:00:59.746379 [6021/1078050640] client_config_job: releasing exclusive mutex 2011-12-03 17:00:59.746514 [6021/1078050640] client_config_job/excp=3/ret_cod=0/errno=0 [...]
Setting the environment variable LIXA_JOB
allows you to associate any Application Program to a custom
user defined job: this may be interesting if you are using a
workload balanced environment, this may be dangerous if you
associate Application Programs using a different set of
Resource Managers to the same job.
If you don't set LIXA_JOB
environment variable,
the default behavior should be strong enought to avoid issues when
LIXA is used under “standard” conditions.
The previous section (see the section called “Application Program equivalence”) explains the conditions that must be met to enable automatic recovery. A tipical scenario that needs tuning is a workload balanced Application Server environment as is in the below picture:
The same program (“Application Program 1”) is
executed by two different Application Servers: this is a typical
configuration used to improve service availability and scalability.
If the Application Server 1 is running in a different host than
Application Server 2 (this is a de facto standard), by default
LIXA will
associate two different jobs.
The LIXA default behavior is not the optimal one when you are using a workload balanced environment.
If the host of Application Server 1 crashed, the Application Program running inside Application Server 2 could not automatically recover the transactions in “prepared/in-doubt/recovery pending” of the Application Server 1 because they are associated to a different job.
This is a scenario when setting LIXA_JOB
is
strongly suggested.
When you set the LIXA_JOB
environment variable
to control LIXA automatic recovery feature you
must not associate the same job to
Application Programs that use different sets of Resource Managers
or use the same set of Resource Managers but with different
options for any Resource Manager.
If you broke this rule, you would probably face difficult to
troubleshoot issues: automatic recovery could fail and you
would have to understand why.
Sometimes you need to force the automatic recovery to happen because the crashed Applicaton Program is a “one shot” program and you can not execute it a second time due to some functional constrain.
Any application program meeting the requirements described above can be used, lixat utility command too. The following example will show you how it works using PostgreSQL and Oracle Resource Managers.
First of all, you must configure, build and install the LIXA project software enabling PostgreSQL, Oracle and crash simulation features:
tiian@ubuntu:~/lixa$ ./configure --with-oracle=/usr/lib/oracle/xe/app/oracle/product/10.2.0/server \ > --with-postgresql-include=/usr/include/postgresql --with-postgresql-lib=/usr/lib \ > --enable-crash
then you must follow the steps described in the section called “An example with PostgreSQL & Oracle” to prepare the scenario environment. Open three different terminal sessions as explained in the above example, and try to insert/delete a row:
[Shell terminal session] |
tiian@ubuntu:~/tmp$ echo $LIXA_PROFILE PQL_STA_ORA_DYN tiian@ubuntu:~/tmp$ echo $ORACLE_HOME /usr/lib/oracle/xe/app/oracle/product/10.2.0/server tiian@ubuntu:~/tmp$ echo $ORACLE_SID XE tiian@ubuntu:~/tmp$ echo $LD_LIBRARY_PATH /usr/lib/oracle/xe/app/oracle/product/10.2.0/server/lib: tiian@ubuntu:~/tmp$ ./example6_pql_ora insert Inserting a row in the tables... Oracle INSERT statement executed! tiian@ubuntu:~/tmp$ ./example6_pql_ora delete Deleting a row from the tables... Oracle DELETE statement executed! |
To simulate a crash after the xa_prepare()
completed successfully, you can set the environment variable
LIXA_CRASH_POINT
to the value
LIXA_CRASH_POINT_PREPARE_2
(see src/common/lixa_crash.h
:
[Shell terminal session] |
tiian@ubuntu:~/tmp$ export LIXA_CRASH_POINT=15 tiian@ubuntu:~/tmp$ echo $LIXA_CRASH_POINT 15 tiian@ubuntu:~/tmp$ ./example6_pql_ora insert Inserting a row in the tables... Oracle INSERT statement executed! Aborted |
You can check there is a prepared (in-doubt) transaction inside Oracle:
[Oracle terminal session] |
SQL> select * from dba_pending_transactions; FORMATID ---------- GLOBALID -------------------------------------------------------------------------------- BRANCHID -------------------------------------------------------------------------------- 1279875137 97DD30A150604AFDBFA5FDC94B611FD5 9BAC7BE1C129EA6EE31F2D71B318120C |
And the same transaction inside PostgreSQL:
[PostgreSQL terminal session] |
testdb=> select * from pg_prepared_xacts; transaction | gid | prepared | owner | database -------------+------------------------------------------------------------------------------+-------------------------------+-------+---------- 874 | 1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c | 2011-12-14 22:02:50.462682+01 | tiian | testdb |
It is suggested to activate the trace related to the “client recovery” module (see the section called “Tracing modules”) before running lixat program:
[Shell terminal session] |
tiian@ubuntu:~/tmp$ export LIXA_TRACE_MASK=0x00040000 tiian@ubuntu:~/tmp$ /opt/lixa/bin/lixat 2011-12-14 22:22:01.740634 [27735/3073944240] client_recovery 2011-12-14 22:22:01.740771 [27735/3073944240] client_recovery: sending 197 bytes ('000191<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="8"><client job="9bac7be1c129ea6ee31f2d71b318120c/127.0.0.1 " config_digest="9bac7be1c129ea6ee31f2d71b318120c"/></msg>') to the server for step 8 2011-12-14 22:22:01.759352 [27735/3073944240] client_recovery: receiving 561 bytes from the server |<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="16"><answer rc="0"/><client job="9bac7be1c129ea6ee31f2d71b318120c/127.0.0.1 " config_digest="9bac7be1c129ea6ee31f2d71b318120c"><last_verb_step verb="5" step="16"/><state finished="0" txstate="3" will_commit="1" will_rollback="0" xid="1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c"/></client><rsrmgrs><rsrmgr rmid="0" next_verb="0" r_state="1" s_state="33" td_state="10"/><rsrmgr rmid="1" next_verb="0" r_state="1" s_state="33" td_state="20"/></rsrmgrs></msg>| 2011-12-14 22:22:01.759776 [27735/3073944240] client_recovery_analyze 2011-12-14 22:22:01.759857 [27735/3073944240] client_recovery_analyze: the TX was committing 2011-12-14 22:22:01.759873 [27735/3073944240] client_recovery_analyze: rmid=0, r_state=1, s_state=33, td_state=10 2011-12-14 22:22:01.759884 [27735/3073944240] client_recovery_analyze: rmid=1, r_state=1, s_state=33, td_state=20 2011-12-14 22:22:01.759902 [27735/3073944240] client_recovery_analyze/excp=1/ret_cod=0/errno=0 2011-12-14 22:22:01.759921 [27735/3073944240] client_recovery: transaction '1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c' must be committed 2011-12-14 22:22:01.759937 [27735/3073944240] client_recovery_commit 2011-12-14 22:22:01.759971 [27735/3073944240] client_recovery_commit: committing transaction '1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c' 2011-12-14 22:22:01.759998 [27735/3073944240] client_recovery_commit: xa_commit for rmid=0, name='PostgreSQL_stareg', xa_name='PostgreSQL[LIXA]'... 2011-12-14 22:22:02.143764 [27735/3073944240] client_recovery_commit: rc=0 2011-12-14 22:22:02.143866 [27735/3073944240] client_recovery_commit: xa_commit for rmid=1, name='OracleXE_dynreg', xa_name='Oracle_XA'... 2011-12-14 22:22:03.188211 [27735/3073944240] client_recovery_commit: rc=0 2011-12-14 22:22:03.188272 [27735/3073944240] client_recovery_commit/excp=1/ret_cod=0/errno=0 2011-12-14 22:22:03.188318 [27735/3073944240] client_recovery: sending 187 bytes ('000181<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="24"><recovery failed="0" commit="1"/><rsrmgrs><rsrmgr rmid="0" rc="0"/><rsrmgr rmid="1" rc="0"/></rsrmgrs></msg>') to the server for step 24 2011-12-14 22:22:03.188496 [27735/3073944240] client_recovery: sending 197 bytes ('000191<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="8"><client job="9bac7be1c129ea6ee31f2d71b318120c/127.0.0.1 " config_digest="9bac7be1c129ea6ee31f2d71b318120c"/></msg>') to the server for step 8 2011-12-14 22:22:03.228361 [27735/3073944240] client_recovery: receiving 95 bytes from the server |<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="16"><answer rc="1"/></msg>| 2011-12-14 22:22:03.228544 [27735/3073944240] client_recovery: the server answered LIXA_RC_OBJ_NOT_FOUND; there are no more transactions to recover 2011-12-14 22:22:03.228589 [27735/3073944240] client_recovery/excp=12/ret_cod=0/errno=0 tx_open(): 0 tx_close(): 0 |
You can now verify there are no more prepared/in-doubt transactions inside the Resource Managers:
[Oracle terminal session] |
SQL> select * from dba_pending_transactions; no rows selected |
[PostgreSQL terminal session] |
testdb=> select * from pg_prepared_xacts; transaction | gid | prepared | owner | database -------------+-----+----------+-------+---------- (0 rows) |
The automatic (warm) recovery process completed successfully because ./example6_pql_ora and /opt/lixa/bin/lixat were associated to the same job and the LIXA state server (lixad) kept the state of the transaction in the meanwhile.
In the next paragraphs you can explore what happens if the previous conditions are not satisfied.