Part 1: Comelec leak – what I discovered
Rappler came out with an article on the Comelec data leak on March 28, 2016. After I commented on the post on Facebook, Rappler invited me to study the leaked data in greater detail.
I'm writing this to make everyone aware of the repercussions of this leak and what we all must do to protect ourselves.
I sought the database on my own without guidance from Rappler, so I could assess how easily searchable and readily accessible the data is.
I started the search through Google and Twitter for links. The Facebook page of LulzsecPinas was prominent in most of the search results, and the page pointed to the download site (at that time, since this often changes due to successful takedown efforts).
There was a torrent file on the list, so I tried that first. I got around 130MB worth of compressed data. The file contents were in a form familiar to me – MySQL – a database management system commonly used by many websites around the world. As a MySQL developer myself, I already had the system installed on my machine, so I uncompressed and loaded the data.
As I was browsing through the database tables (the technical term for a placeholder of data in rows and columns, similar to a spreadsheet), I saw mostly data that was relevant only to Comelec's internal systems, primarily Human Resources.
I didn’t get what the big deal was, other than that someone hacked the Comelec website and leaked all its data.
Downloading massive data
After reading a few more articles, I noticed references to other data that I hadn’t encountered yet, so I searched what was missing in my download. It turned out that I had not been able to download the largest file – comweb.sql – which was about 70GB compressed and 380GB uncompressed.
From my research, this file supposedly contained data of 55 million voters. It was not in the torrent file, but it was there in the list of individual files available for download.
I tried downloading that large file on my machine first, but then I realized that was a stupid idea since the server was from the US and Internet speeds here are slow.
I computed about 3 to 4 days 24x7 with my machine continuously on. This was not doable.
I realized then, that I could download and do my analysis from some server in the cloud. I chose a private cloud server in the US so download speeds from the source server would be faster. Only I have access to this cloud server at the moment.
While downloading (which took a little more than 30 hours non-stop), I set up ample storage as prescribed by LulzsecPinas in a file called ‘README.txt’:
Total of 340GB when extracted, please free around 360GB before extraction, especially for comweb.sql.gz
This is the whole database leak of Comission on Elections, don't worry some of the tables are encrypted by Comelec.
But we have the algo to decrypt those data. LOL.
Leaks powered by: LulzSec Pinas
I also installed MySQL on the new server from scratch.
After the download was completed, I unzipped the files (which took about 380 gigabytes of space in total, not 340 gigabytes as stated in the instructions) and then prepared and ran scripts to load the data into MySQL.
As I write this, the non-stop upload process of 380 gigabytes worth of data hasn’t yet completed after 3 weeks.
Exploring the data
While loading data onto MySQL, I was peeking into the different tables and files.
I saw a set of tables containing what seemed to be Overseas Absentee Voting data. Some old tables had the columns such as names, date of birth, and passport encrypted (meaning garbled and cannot be understood by humans seeing it), but there were newer versions of these tables where these same columns were readable.
I felt at that moment that this was serious and I had to tell Rappler; however, they beat me to making this discovery, and most of what I found, plus much more, were already reported.
After the data loading completes, there should be 308 data tables in total, and the largest ones involve several versions of local voters’ data.
Each version contains about 70 million records; someone told me that this should only be about 55 million, and the other 15 million are deactivated.
More than two days after I started the initial load, the first version of the large voters table became available for querying. I saw that while there were important columns involving name, address, date of birth, and precinct, most of these were encrypted.
I recalled that Comelec had already told the public, and there was an assurance that the sensitive data will be very difficult to decrypt. What stuck with me, though, was what LulzsecPinas said about them knowing how to decrypt the data.
During the initial load, I noticed that some of the files contained application source code. I found this weird because I rarely find instances nowadays where one would mix code and data – not impossible, but rare.
What this meant to me was that if programming logic were available, then most likely this logic would contain instructions to access the database. But the data is encrypted, which meant that the program instructions would provide hints on how to decrypt the data.
To my shock, I saw not only the functions to decrypt but to encrypt as well.
Worse is that even the two critical inputs to the encryption and decryption function—the “key” (which is similar to a password) and the “initial vector” (which is some value that determines what set of encrypted characters appear) were included!
The encryption and decryption method used by the Comelec is the Advanced Encryption Standard (AES)/Rijndael algorithm. It is supposed to be almost impossible to crack, even with advanced tools and computing power.
I say "almost" because no matter how sophisticated the lock is, anyone with the key can open it. In this case, someone left the keys under the doormat.
The programming language used by the Comelec’s web developers here is PHP, which is common among web developers. Anybody with knowledge of PHP (and there are lots of them) will be able to decrypt the data after seeing the lines of code containing the encryption and decryption routines with the two parameters.
I wrote a small program using the encryption/decryption methods found earlier. Then, I got the first record of the voters’ table, took the first name, which was encrypted, and ran it through the decryption function. I turned pale because I saw the first name in clear text!
The next challenge was to search for a specific record. How can this be done if the names and birthdays are encrypted? The solution is to encrypt the search criteria. I revised the script, this time to encrypt my first name and last name, which I then used to search for the voters’ table for records matching these two encrypted values.
I saw one record, with my old address and everything else encrypted. I then tried decrypting my supposed middle name and birthday, and they all came out in clear text – all correct.
Then I searched for people I knew. Most of them came out in a similar fashion. Except for a lot of outdated values, the information was generally accurate.
After what I discovered, I wondered if others had made a similar discovery. It turned out that Lulzsec itself had already hinted about this the day before my discovery:
Lulzsec Pilipinas also dropped hints again a few days after, with them planning to give the correct keys and initial vector to someone who understood what they were talking about.
I am not an Information Security expert, and I do not carry many IT certifications.
My company and I do Enterprise Software Development, and while I am compelled to have a strong grasp of Information Security practices, I am not a security hacker by training or practice.
That I was able to figure out how to display otherwise encrypted values in clear text form should scare you because others trained in this area would have already figured this out by now as well.
I didn’t need to break through the door with special skills. The keys were already there.
Moral and ethical dilemma: To share or not to share
Rappler and I discussed whether to share this with the public and, if so, how much detail we need to share. One danger is that people might start looking for the key that I found in the data dump.
The reality is that I was not the first to discover a way to read the encrypted data.
The hackers had seen this loophole way back, and it was just a matter of time till other people knew about it as well – whether the hackers themselves published the keys and encryption/decryption methods or others figured things out on their own.
Know, not just assume, that everything is already available in clear text form. The creators of We Have Your Data, the website that allows searching for voters’ information in the leaked data, have proven it.
Unfortunately, it will take some time and effort for the Comelec to change the keys. Comelec will have to update all fields and revise the applications that use the data. It's not as simple as changing the key (just like changing the password). To be concluded: Part 2: Comelec leak – what can go wrong? – Rappler.com