Checking Web Content for Sensitive Data? 44
NetFiber asks: "I work as a security analyst for a large university. We have recently been tasked to scour our network in the hopes of finding and removing sensitive information such as credit card numbers, social security numbers, and such on all publicly available web servers. Our current method of analysis is to archive all the content (which often grows over 100GB) and later parse the data with various utilities and regexes that search for patterns and other pertinent information. So far, this process has proven to be rather cumbersome and time consuming. Does anyone have any experience collecting and sanitizing large amounts of web content? If so, what procedures/utilities do you use to accomplish this?"
Visa PCI CISP is a good set of practices (Score:5, Informative)
Of course, you're probably not interested specifically in protecting "Visa's track data" but in whatever data you consider sensitive. Applying the listed policies and practices would go a long way towards securing your resources, whatever it is you want to secure.
As a large corporation, failure to comply would mean the penalties would be severe (and most likely business-damaging.) If you're not handling card data, you won't have the same consequences, of course. What the penalties meant to us, though, is that top management made a decree: 'fix the problems and pass the audit -- we can't afford not to.' Having top-down pressure means that if we have sensitive data that we're passing to another team, we're both inclined to work together to solve the issues. If one team balks, a phone call up the pyramid gets things back on track. If your university is serious about this, a similar edict will go a long way towards cleanup.
Another boost in the direction of securing our data was hiring an external consultant to perform the audit. Our auditor is very knowledgeable about ways to follow the data: where does it enter the system, where does it go from there, who writes it to disc, why do they save it, and do they have a business need to save it? Can the data be eliminated? Can a token be substituted for the data? Can the data be truncated? If not, can it at least be masked on reports where the details aren't needed?
As far as specifics go, each development and maintenance director's pyramid was required to assign a manager to own the PCI process. Each team had to go through their code, identify sensitive data, and take steps to protect it. They also had to go to the data owners, and have them redact their archives.
It's huge. But given the security breaches that are almost a daily occurrance, we can't afford not to.
Re:Visa PCI CISP is a good set of practices (Score:3, Insightful)
PCI/CISP does have software process recommendations for securing credit card data, but it's largely recommendations for people processes and facility processes.
I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.
Re:Visa PCI CISP is a good set of practices (Score:5, Insightful)
I know what he's asking for, and I answered with what it takes to make it happen for real. The answer is the various teams that are storing the data need to be held accountable for storing it securely. Just grepping for and deleting a database holding SSNs isn't enough -- his university has to make sure that all the teams are educated to not ask for nor store SSNs. They'll also benefit from a cohesive policy that gives specifics, such as "replace SSNs with student ID numbers."
If this is just some security manager saying "go find SSNs and wipe 'em out" then they're up the creek. For every database they clean up, someone else will have created a new one. They'll be ignored and stonewalled by teams who have neither the time nor the budget to comply. This sort of thing has to come down from the board of regents, and they have to put the responsibility on everyone, otherwise they're just pissing in the wind.
Re:Visa PCI CISP is a good set of practices (Score:2)
You know this? Wow, you've got pretty good between-the-lines vision, because he sure didn't say that in TFQ.
What I read from TFQ is "help me scan for bad data" and replied with "you can scan for bad data until the cows come home, but until you have a big stick to smack future violators you will have accomp
Re:Visa PCI CISP is a good set of practices (Score:3, Interesting)
I completely agree. We had to do this when I was contracting for the government a number of years ago. Even in the databases at the time there was a veritiable cornucopia of plain ASCII characters stored where nowadays we know that those types of data should be at least encrypted, and probably not stored in a column called (in plain text) SSN or some such thing.
<offtopic_sidebar>Ironicall
Re:Visa PCI CISP is a good set of practices (Score:2)
There are lots of articles online about writing crawlers and search engines using "off-the-shelf" components such as stuff found on Perl CPAN.
Once you have a basic crawler working (should be easy), have it look for regex patterns matching SSN's, CC #'s,
Which is entirely the wrong approach (Score:5, Informative)
And I never cease to be amazed by the sheer number of people sharing that belief that there's some magical amulet (uber-security program/appliance/whatever) that you can just tack onto a site and make it auto-magically secure.
Unfortunately that kind of thinking is outright counter-productive. It's dangerous. It's the kind of thinking that breeds such disasters as "we use SSL, so we're secure." (Shame that someone uploaded confidential documents on the web site anyway, so they can be downloaded by anyone. _Securely_ downloaded, to be sure;) Or "we have a Snake Oil (TM) gateway that can scan SOAP requests, so we're secure." (Shame that noone actually configured the rules for it, though. Or shame that the Web front-end there allowed users to escalate their privileges _before_ it all got packed in a SOAP request: the gateway can't detect whether it's genuinely a site admin or a regular user who escalated their privileges.) Or "we have a hardened Single Sign-On front-end in front of the servers, enforcing login and access rights, so we're secure." (Shame, that, literally, one application allowed users to escalate their privileges and see any content, by just editing the URL. E.g., someone could edit the admin's password by just editing the admin's user ID in the URL for the password change page, _then_ properly log in as the admin through that hardened SSO front-end. Literally. I'm not making it up.) Etc.
But to address your actual point: content scanners aren't the answer, or rather are a bad and incomplete answer. E.g., I've seen one company deploy such a thing in front of the back-end, in their case to supposedly protect against SQL injection in the front-end. So it rejected anything that looked like an SQL keyword. Should be secure, right? But what do you do if it's not as secure or well-programmed as you think? E.g., the thing would cause a form submit to fail if you wrote something like "Visa Select" in a field, because it contained "select", but actually failed to protect against actual SQL injection using the quote sign, or XSS injection using the greater-then and less-than signs.
Worse yet, it encouraged everyone to be lax and don't bother thinking about security or doing a code review, because, hey, they have the magical amulet on the backend. Even worse, it encouraged managers to not allocate time or resources for an actual security review.
Security isn't about magical amulets, it's "holistic", so to speak. The security chain is literally as weak as the weakest link. People need to be educated to actually sit and think about the whole and about every single piece and scenario, not to throw in a couple of +5 Security amulets and call it a day. Throwing in the towel and relying on some magical amulet which somehow makes it all secure just because it's there, is actually the antonym and nemesis of security.
Even if such appliances and programs are used, someone needs to sit and think about how they're used, how they affect their own program, what they prevent, and most importantly what they _don't_ prevent. What data and how does it prevent from being stolen, and what happens when (not if) someone _does_ get through. E.g., what data you shouldn't be collecting in the first place anyway, because you don't actually need it. (If it's not there at all, it can't get stolen.) And most often the right thing to do is _not_ to rely on them: they're there as a last ditch defense, that can't catch everything, but it's one last chance to _maybe_ catch something that got through the other layers of defense. Not as a replacement for the other layers.
And teams and managers need to be educated that they _need_ to do just that: sit and do a proper analysis. And not just the technical implementation parts, but also, yes, the people processes involved. E.g., if a process can w
Re:Which is entirely the wrong approach (Score:2)
Re:Visa PCI CISP is a good set of practices (Score:2)
Re:Visa PCI CISP is a good set of practices (Score:2)
Yeah, I know, and that's the problem.
Unfortunately, the right answer is probably a big stick attached to the end of the policy. "Our policy is one of zero tolerance. If you violate these rules you will be fired, tenure notwithstanding. We have to protect our students first and our reputation second, and nothing else, including your convenience, your research, your history, your prominence in your field, your title, or your budget is justification for violation o
Re:Visa PCI CISP is a good set of practices (Score:2)
Or, try a way to prevent it leaking out as well. (Score:3, Interesting)
Some commercial vendors eg. Citrix (Teros), Imperva etc. offer stuff like this in an appliance, and there has to be some sort of thing you could do with Apache and OSS stuff as well depending on your needs. It might not catch everything but hey, your code base is always changing and a one-time audit might not find a problem that shows up six months after the audit is done. Some sort of preventative measure working hand-in-hand with regular audits is probably your best bet in the long run.
Re:Or, try a way to prevent it leaking out as well (Score:3, Interesting)
Of course, storing the information again and then searching it is pretty silly. You don't want to know what used to be out there, you want to know what's currently out there and as a bonus, it's already taking up storage space somewhere, so why duplicate it? In order to "copy" it, you're going to take
Re:Or, try a way to prevent it leaking out as well (Score:2)
mod_security for Apache can do exactly this sort of regex matching and serve up an error page if a match is found. The logs are pretty easy to grep to find occurences of a match and hence track the data down.
Re:Or, try a way to prevent it leaking out as well (Score:1)
The answer is simple (Score:5, Funny)
Given enough time, some industrious hacker will find all the data for you.
Then, when you read the Slashdot article titled "[Name of Your Company] Leaks Private Data", you'll know exactly where the pertinent files are.
At that point you can take care of them. The pay out to the privacy lawsuites will probably end up being less than the cost in man hours to do the job semi-manually. In the end, you'll still come out on top. (Though there is the off-chance that your company and your replacement will come out on top...)
Re:The answer is simple (Score:2, Insightful)
I think the OP may be hoping for that, since they're posting on Slashdot and have disclosed the identity of the university just as cleverly as any redacted PDF would.
Re:The answer is simple (Score:2)
So, hoping for that might be exactly what he(they) wants. The man will strike hard.
Re:The answer is simple (Score:2)
Re:The answer is simple (Score:2)
johnny i hack stuff (Score:3, Insightful)
Re:johnny i hack stuff (Score:1)
Look at the images too (Score:4, Interesting)
Another NSA troll looking for tech help on /. (Score:4, Funny)
Dear Sir... (Score:5, Funny)
Kindly forward us the backup tapes with your data as well as a representative list of personal data you are striving to secure (such as student SS#, birth dates, Mother Maiden Names, corporate purchase cards, etc.) and we will promptly perform the audit for you.
This is absolutely legal, and you will be allowed to keep 10% of whatever we find.
[no, no it's a joke, dammit!]
SQL Server backups (Score:3, Interesting)
Like this [google.com.au]
Download, restore, maybe find something useful...
McAfee/Foundstone's free SiteDigger (Score:2, Informative)
ok, this is easy (Score:1, Funny)
VSDB (Score:2)
here's a device that does just that: (Score:2)
One tip (Score:1, Insightful)
Is your university contributing to the students' Social Security accounts for some unknown reason? If not, there's no legitimate reason for the school to continue to use students' Social Security information.
Same with birth dates. In grade school, along with permanent records, we were assigned a s
Re:One tip (Score:1)
That will work great, assuming everybody pays their own tuition. Because if anybody wants any Title IV aid (grants, student loans, etc.), pretty much the first entry that has to be submitted on every form required by, and every record transmitted to and from the government is
What world do you live in? (Score:2)
mod_security (Score:1)
Then I would recoment the mod_security module for apache http://www.modsecurity.org/ [modsecurity.org] It will scan any POST requests for banned pattern. You could leverage the regexes you already wrote to scan the content in the first place.
I think mod_secrity does what the FS and McAfee appliances do at much better (free as in beer) pri
Visual Web Mining Toolkit (Score:1)
I don't envy you (Score:1)