Unable to get Output from MAPD

Hello Mapd Team

We are facing one strange issue in our MAPD system where by we are unable to get the response from MAPD . When we fire below query ,this is what we see on Mapd command line terminal

Mapd version

omnisql> \version
OmniSci Server Version: 5.10.2-20220218-4112053580

Thrift: Thu Jul 21 11:31:22 2022 TSocket::open() connect() <Host: localhost Port: 6274>: Connection refused
User wdbsreport connected to database wdbsreportdb
omnisql> select count(1) from WDBS_ZONE;
Thrift error: No more data to read.
Thrift connection error: No more data to read.
Retrying connection
Thrift error: No more data to read.
Thrift connection error: No more data to read.
Retrying connection
Thrift: Thu Jul 21 11:31:38 2022 TSocket::write_partial() send() <Host: localhost Port: 6274>: Broken pipe
Thrift error: write() send(): Broken pipe
Thrift connection error: write() send(): Broken pipe
Retrying connection

When we try to check the logs ,we see below

==> omnisci_server.INFO <==
2022-07-21T11:31:29.562848 I 89496 0 4 DBHandler.cpp:2503 stdlog get_tables_for_database 7 0 wdbsreportdb calcite 128-19xu {“client”} {“tcp:localhost:11048”}
2022-07-21T11:31:29.570516 I 89496 0 5 DBHandler.cpp:2327 stdlog get_internal_table_details_for_database 8 0 wdbsreportdb calcite 128-19xu {“table_name”,“client”} {“WDBS_ZONE”,“tcp:localhost:11050”}

==> omnisci_server.INFO.20220721-113003.log <==
2022-07-21T11:31:29.562848 I 89496 0 4 DBHandler.cpp:2503 stdlog get_tables_for_database 7 0 wdbsreportdb calcite 128-19xu {“client”} {“tcp:localhost:11048”}
2022-07-21T11:31:29.570516 I 89496 0 5 DBHandler.cpp:2327 stdlog get_internal_table_details_for_database 8 0 wdbsreportdb calcite 128-19xu {“table_name”,“client”} {“WDBS_ZONE”,“tcp:localhost:11050”}

==> omnisci_server.INFO <==
2022-07-21T11:31:29.961273 I 89496 0 2 Calcite.cpp:573 Time in Thrift 19 (ms), Time in Java Calcite server 1271 (ms)
2022-07-21T11:31:29.961596 F 89496 0 2 FileMgr.cpp:1118 UNREACHABLE
2022-07-21T11:31:30.728515 I 89496 0 6 MapDServer.cpp:323 Interrupt signal (6) received.

==> omnisci_server.INFO.20220721-113003.log <==
2022-07-21T11:31:29.961273 I 89496 0 2 Calcite.cpp:573 Time in Thrift 19 (ms), Time in Java Calcite server 1271 (ms)
2022-07-21T11:31:29.961596 F 89496 0 2 FileMgr.cpp:1118 UNREACHABLE
2022-07-21T11:31:30.728515 I 89496 0 6 MapDServer.cpp:323 Interrupt signal (6) received.

==> omnisci_server.WARNING <==
2022-07-21T11:31:29.961596 F 89496 0 2 FileMgr.cpp:1118 UNREACHABLE

Any feedback/Help will be much appreciated .

HI @Raj_Kiran,

to get an idea, of when you start to have this issue? Is this issue circumvented to this particular table or a particular database?

The select count(*) from the table wouldn’t even access to the data but just the metadata. What happens if you run select(fielad_name_nullable) from the table?.

Thanks in advance,
Candido

Then you can use check the status of the files in the filesystem this way

run

heavysql> show databases;
Database|Owner
omnisci|admin
adsb|admin
asof|admin

and starting from omnisci database that the database with id of 1 count until your database. In my example I connected to database asof that’s has the id of 3

then run

show table details WDBS_ZONE, then get the first number, that’s the table_id and check in your data directory (typically /var/lib/omnisci) and check the status of the directory and the files with ls command. S if the table_id is 10

ls -la /var/lib/omnisci/data/mapd_data/table_3_10

you should get an output like this

drwxr-xr-x   2 mapd mapd      4096 lug 21 12:02 .
drwxrwxr-x 401 mapd mapd     20480 lug 21 11:53 ..
-rw-r--r--   1 mapd mapd 536870912 giu 30  2019 0.2097152.mapd
-rw-r--r--   1 mapd mapd  16777216 giu 30  2019 1.4096.mapd
-rw-r--r--   1 mapd mapd         4 giu 30  2019 epoch
-rw-rw-r--   1 mapd mapd        16 lug 21 12:02 epoch_metadata
-rw-rw-r--   1 mapd mapd         4 lug 21 12:02 filemgr_version

after that try this command
xdd /var/lib/omnisci/data/mapd_data/table_3_10/filemgr_version

and share the output of the commands with us.

Can I ask if you have tried an upgrade to the 6.0 that’s failed and you did a sort of rollback?

Regards,
Candido

Hello @candido.dessanti

This happens only for few tables . Sorry we are unable to run even show details commands as below

omnisql> show databases;
Database|Owner
mapd|mapd
wdbsreportdb|wdbsreport
omnisql>
omnisql> show table details WDBS_ZONE
…> ;

When we run the above commands ,MAPD freezes and it doesn’t even allow login and prints below error when tried to login from other terminal. Error as below

/opt/omnisci/bin/omnisql XXXX -u XXXXXX -p XXXXXXXX

Thrift: Thu Jul 21 16:18:01 2022 TSocket::open() connect() <Host: localhost Port: 6274>: Connection refused
Thrift error: No more data to read.
Thrift connection error: No more data to read.
Retrying connection
Thrift: Thu Jul 21 16:18:18 2022 TSocket::write_partial() send() <Host: localhost Port: 6274>: Broken pipe
Thrift error: write() send(): Broken pipe
Thrift connection error: write() send(): Broken pipe
Retrying connection
Thrift: Thu Jul 21 16:18:22 2022 TSocket::write_partial() send() <Host: localhost Port: 6274>: Broken pipe
Thrift error: write() send(): Broken pipe
Thrift connection error: write() send(): Broken pipe
Retrying connection
Thrift: Thu Jul 21 16:18:30 2022 TSocket::write_partial() send() <Host: localhost Port: 6274>: Broken pipe
Thrift error: write() send(): Broken pipe
Thrift connection error: write() send(): Broken pipe
Retrying connection

Have tried to extract the file system details as below

sqlite> select dbid,name from mapd_databases;
1|mapd
2|wdbsreportdb

sqlite> select name,tableid from mapd_tables where name=‘WDBS_ZONE’;
WDBS_ZONE|19
sqlite>

[wdbs@pcrfreporting mapd_data] ll|grep _19 drwxr-xr-x 2 root root 56 Feb 27 2019 DB_1_DICT_19 drwxr-xr-x 2 root root 63 May 26 11:30 table_2_19 [wdbs@pcrfreporting mapd_data]

[wdbs@pcrfreporting DB_1_DICT_19] ls -la /opt/data/data/mapd_data/DB_1_DICT_19 total 8208 drwxr-xr-x 2 root root 56 Feb 27 2019 . drwxr-xr-x 307 root root 12288 Jul 21 15:45 .. -rw-r--r-- 1 root root 4194304 Feb 27 2019 DictOffsets -rw-r--r-- 1 root root 4194304 Feb 27 2019 DictPayload [wdbs@pcrfreporting DB_1_DICT_19] ls -la /opt/data/data/mapd_data/table_2_19
total 24
drwxr-xr-x 2 root root 63 May 26 11:30 .
drwxr-xr-x 307 root root 12288 Jul 21 15:45 …
-rw-r–r-- 1 root root 16 May 26 11:30 epoch_metadata
-rw-r–r-- 1 root root 5 May 26 11:30 filemgr_version
[wdbs@pcrfreporting DB_1_DICT_19]$

debug.txt (2.4 KB)
have uploaded traces in attachment

@candido.dessanti sorry

answering your other 2 queries

No have not tried to upgrade 6.0

Xdd command is not available on our production server where MAPD is running

well,

from which I can see here the table is empty (probably has been truncated?) and the file filemgr_version is badly formed can you post the output of the command.

xdd /opt/data/data/mapd_data/table_2_19/filemgr_version

(you can also try this.
backup the directory containing the table this way cp /opt/data/data/mapd_data/table_2_19/ /opt/data/data/mapd_data/table_2_19_backup
and run
echo -n -e ‘\x1\x0\x0\x0’ >/opt/data/data/mapd_data/table_2_19/filemgr_version
)

Hello ,

Have tried suggested commands ,have tried below

[root@xxxxxxx ~]# cp -r /opt/data/data/mapd_data/table_2_19/ /opt/data/data/mapd_data/table_2_19_backup

[root@xxxxxxx ~]# echo -n -e ‘\x1\x0\x0\x0’ >/opt/data/data/mapd_data/table_2_19/filemgr_version

[wdbs@xxxxxxx ~] xxd /opt/data/data/mapd_data/table_2_19/filemgr_version {0000000: e280 9878 3178 3078 3078 30e2 8099 ...x1x0x0x0...} [wdbs@xxxxxxx ~]

Sorry .Still have same error

Hi,

i have been able to reproduce setting the error using a negative number in the filemgr_version, so setting to 1 it’s impossible get the error.

probably you are querying another table?

could you run the command
xdd /opt/data/data/mapd_data/table_2_19/filemgr_version

and on another table thats working (18 maybe)
xdd /opt/data/data/mapd_data/table_2_18/filemgr_version

hi @candido.dessanti

Please see the attached doc with details of working and non working table
debug1.txt (1.8 KB)

Hi @raj,

looking at your data
[wdbs@pcrfreporting log]$ xxd /opt/data/data/mapd_data/table_2_19/filemgr_version
0000000: e280 9878 3178 3078 3078 30e2 8099 ...x1x0x0x0...

this file look corrupted. when you run the command
echo -n -e ‘\x1\x0\x0\x0’ >/opt/data/data/mapd_data/table_2_19/filemgr_version

the resulting file would be 4 bytes and like this one

0000000: 0100 0000

Have you moved the database on other disks lately?
Could you try do do this
un-mount and remount the filesystem where you data is located?

hi @candido.dessanti

Thanks for your feedback we have not moved data to any disks and all the tables that are within MAPD resides on same disk and mount point . DO you feel if we try to drop and recreated the corrupted tables ,will it help ?

Hi,

I am not sure the tables are corrupted, but it looks the filesystem is because if you run the echo command you should get a 4 bytes file with the content 01000000, not that random number you are getting. You can try removing the filemgr_version of the table 2 19 restart the database and see what happens.

Looks a filesystem corruption to me, maybe some ssd are failing on some parts. It happened once to me

@candido.dessanti okay .We shall seek window from customer ,do below

rm -f /opt/data/data/mapd_data/table_2_19/filemgr_version

Restart Database .Assuming you meant the same by saying try removing .

Thank you

debug2.txt (311 Bytes)
Also please see the attached txt file where i have fired xxd on backup file that i took before running echo command and then with the latest file on which echo was fired
debug2.txt (311 Bytes)

have you did echo in the 2_19 table?

I’m seeing
Backup

[wdbs@pcrfreporting ~]$ xxd /opt/data/data/mapd_data/table_2_19/filemgr_version_210722
0000000: 0000 00ff ff …

Post doing echo

[wdbs@pcrfreporting ~]$ xxd /opt/data/data/mapd_data/table_2_46/filemgr_version
0000000: 0100 0000

Hi Sorry

Please refer this debug3

debug3.txt (385 Bytes)

try to run the

echo -n -e ‘\x1\x0\x0\x0’ >/opt/data/data/mapd_data/table_2_19/filemgr_version
and then this on the same file
xxd /opt/data/data/mapd_data/table_2_19/filemgr_version

the database is crashing because an unexpected value is read and it’s aborting the server to limit a possible corruption.

so the possible solutions, are fixing the filemgr_version files with the echo -n -e ‘\x1\x0\x0\x0’ command, or removing tham and making the syste re-create, but I’m not sure it’s going to work, because the values into the files cannot be come from the software, so check you disk and the filesystem to be sure that you havent a corruption

Thanks @candido.dessanti …Shall try and update by tomorrow

Hi Raj,

I will wait for feedback then