Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?

Checking Web Content for Sensitive Data? 44

NetFiber asks: "I work as a security analyst for a large university. We have recently been tasked to scour our network in the hopes of finding and removing sensitive information such as credit card numbers, social security numbers, and such on all publicly available web servers. Our current method of analysis is to archive all the content (which often grows over 100GB) and later parse the data with various utilities and regexes that search for patterns and other pertinent information. So far, this process has proven to be rather cumbersome and time consuming. Does anyone have any experience collecting and sanitizing large amounts of web content? If so, what procedures/utilities do you use to accomplish this?"
This discussion has been archived. No new comments can be posted.

Checking Web Content for Sensitive Data?

Comments Filter:
  • by plover ( 150551 ) * on Wednesday June 28, 2006 @11:34PM (#15625669) Homepage Journal
    As a large merchant that handles Visa card numbers, we have to undergo an annual Visa PCI CISP audit. The questions are pretty thorough, and if you can fully pass the audit you can tell management that you've reduced your risk of exposure. The link to the pages are here: CISP [visa.com].

    Of course, you're probably not interested specifically in protecting "Visa's track data" but in whatever data you consider sensitive. Applying the listed policies and practices would go a long way towards securing your resources, whatever it is you want to secure.

    As a large corporation, failure to comply would mean the penalties would be severe (and most likely business-damaging.) If you're not handling card data, you won't have the same consequences, of course. What the penalties meant to us, though, is that top management made a decree: 'fix the problems and pass the audit -- we can't afford not to.' Having top-down pressure means that if we have sensitive data that we're passing to another team, we're both inclined to work together to solve the issues. If one team balks, a phone call up the pyramid gets things back on track. If your university is serious about this, a similar edict will go a long way towards cleanup.

    Another boost in the direction of securing our data was hiring an external consultant to perform the audit. Our auditor is very knowledgeable about ways to follow the data: where does it enter the system, where does it go from there, who writes it to disc, why do they save it, and do they have a business need to save it? Can the data be eliminated? Can a token be substituted for the data? Can the data be truncated? If not, can it at least be masked on reports where the details aren't needed?

    As far as specifics go, each development and maintenance director's pyramid was required to assign a manager to own the PCI process. Each team had to go through their code, identify sensitive data, and take steps to protect it. They also had to go to the data owners, and have them redact their archives.

    It's huge. But given the security breaches that are almost a daily occurrance, we can't afford not to.

    • PCI/CISP does have software process recommendations for securing credit card data, but it's largely recommendations for people processes and facility processes.

      I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.

      • by plover ( 150551 ) * on Thursday June 29, 2006 @12:05AM (#15625737) Homepage Journal
        I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.

        I know what he's asking for, and I answered with what it takes to make it happen for real. The answer is the various teams that are storing the data need to be held accountable for storing it securely. Just grepping for and deleting a database holding SSNs isn't enough -- his university has to make sure that all the teams are educated to not ask for nor store SSNs. They'll also benefit from a cohesive policy that gives specifics, such as "replace SSNs with student ID numbers."

        If this is just some security manager saying "go find SSNs and wipe 'em out" then they're up the creek. For every database they clean up, someone else will have created a new one. They'll be ignored and stonewalled by teams who have neither the time nor the budget to comply. This sort of thing has to come down from the board of regents, and they have to put the responsibility on everyone, otherwise they're just pissing in the wind.

        • ...also benefit from a cohesive policy that gives specifics, such as "replace SSNs with student ID numbers."
          I completely agree. We had to do this when I was contracting for the government a number of years ago. Even in the databases at the time there was a veritiable cornucopia of plain ASCII characters stored where nowadays we know that those types of data should be at least encrypted, and probably not stored in a column called (in plain text) SSN or some such thing.

      • ISTM that the kind of application needed would be a (customized) crawler / search engine.

        There are lots of articles online about writing crawlers and search engines using "off-the-shelf" components such as stuff found on Perl CPAN.

        Once you have a basic crawler working (should be easy), have it look for regex patterns matching SSN's, CC #'s, ..., and then log or save the offending URL's/pages.

      • by Moraelin ( 679338 ) on Thursday June 29, 2006 @05:03AM (#15626525) Journal

        I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.

        And I never cease to be amazed by the sheer number of people sharing that belief that there's some magical amulet (uber-security program/appliance/whatever) that you can just tack onto a site and make it auto-magically secure.

        Unfortunately that kind of thinking is outright counter-productive. It's dangerous. It's the kind of thinking that breeds such disasters as "we use SSL, so we're secure." (Shame that someone uploaded confidential documents on the web site anyway, so they can be downloaded by anyone. _Securely_ downloaded, to be sure;) Or "we have a Snake Oil (TM) gateway that can scan SOAP requests, so we're secure." (Shame that noone actually configured the rules for it, though. Or shame that the Web front-end there allowed users to escalate their privileges _before_ it all got packed in a SOAP request: the gateway can't detect whether it's genuinely a site admin or a regular user who escalated their privileges.) Or "we have a hardened Single Sign-On front-end in front of the servers, enforcing login and access rights, so we're secure." (Shame, that, literally, one application allowed users to escalate their privileges and see any content, by just editing the URL. E.g., someone could edit the admin's password by just editing the admin's user ID in the URL for the password change page, _then_ properly log in as the admin through that hardened SSO front-end. Literally. I'm not making it up.) Etc.

        But to address your actual point: content scanners aren't the answer, or rather are a bad and incomplete answer. E.g., I've seen one company deploy such a thing in front of the back-end, in their case to supposedly protect against SQL injection in the front-end. So it rejected anything that looked like an SQL keyword. Should be secure, right? But what do you do if it's not as secure or well-programmed as you think? E.g., the thing would cause a form submit to fail if you wrote something like "Visa Select" in a field, because it contained "select", but actually failed to protect against actual SQL injection using the quote sign, or XSS injection using the greater-then and less-than signs.

        Worse yet, it encouraged everyone to be lax and don't bother thinking about security or doing a code review, because, hey, they have the magical amulet on the backend. Even worse, it encouraged managers to not allocate time or resources for an actual security review.

        Security isn't about magical amulets, it's "holistic", so to speak. The security chain is literally as weak as the weakest link. People need to be educated to actually sit and think about the whole and about every single piece and scenario, not to throw in a couple of +5 Security amulets and call it a day. Throwing in the towel and relying on some magical amulet which somehow makes it all secure just because it's there, is actually the antonym and nemesis of security.

        Even if such appliances and programs are used, someone needs to sit and think about how they're used, how they affect their own program, what they prevent, and most importantly what they _don't_ prevent. What data and how does it prevent from being stolen, and what happens when (not if) someone _does_ get through. E.g., what data you shouldn't be collecting in the first place anyway, because you don't actually need it. (If it's not there at all, it can't get stolen.) And most often the right thing to do is _not_ to rely on them: they're there as a last ditch defense, that can't catch everything, but it's one last chance to _maybe_ catch something that got through the other layers of defense. Not as a replacement for the other layers.

        And teams and managers need to be educated that they _need_ to do just that: sit and do a proper analysis. And not just the technical implementation parts, but also, yes, the people processes involved. E.g., if a process can w

        • I get damn near all the industry publications that exist and the advertisements in them, as well as more than a few articles, encourage the belief in that magical amulet of security +5. As we both know, security is a process, or actually a collection of processes. I like to think of it as consisting of three items:
          • Security by design - security has to be engineered into the design from the very beginning, not tacked on after the fact.
          • Security by policy - policies must be put in place and enforced to ens
      • Your exactly right and completely wrong, his biggest problem is he's surrounded by people smart enough to do really stupid things. One of these smart people is going to decide that they can secretly get their data from anywhere by putting it on a web server without a link, that way only they will know, well him and his two undergrad assistants, well they'll tell their girl-friends of course, then mysteriously the link gets posted to LeetHaxor.ru and of course google crawls leethaxor.ru, then the whole world
        • Your exactly right and completely wrong

          Yeah, I know, and that's the problem.

          Unfortunately, the right answer is probably a big stick attached to the end of the policy. "Our policy is one of zero tolerance. If you violate these rules you will be fired, tenure notwithstanding. We have to protect our students first and our reputation second, and nothing else, including your convenience, your research, your history, your prominence in your field, your title, or your budget is justification for violation o

  • by rdunnell ( 313839 ) * on Thursday June 29, 2006 @12:02AM (#15625729)
    If you can do a regex of what you are looking for, you might be able to put some infrastructure in front of your web apps that controls what goes out.

    Some commercial vendors eg. Citrix (Teros), Imperva etc. offer stuff like this in an appliance, and there has to be some sort of thing you could do with Apache and OSS stuff as well depending on your needs. It might not catch everything but hey, your code base is always changing and a one-time audit might not find a problem that shows up six months after the audit is done. Some sort of preventative measure working hand-in-hand with regular audits is probably your best bet in the long run.

    • Network filtering would be useful as a proactive preventative, but that's going to cause a serious network slowdown in most large environments while at the same time not catching the root causes of the problem.

      Of course, storing the information again and then searching it is pretty silly. You don't want to know what used to be out there, you want to know what's currently out there and as a bonus, it's already taking up storage space somewhere, so why duplicate it? In order to "copy" it, you're going to take
    • If you can do a regex of what you are looking for, you might be able to put some infrastructure in front of your web apps that controls what goes out.
      Interesting idea... but I'd do it as an adjunct as you suggest.

      mod_security for Apache can do exactly this sort of regex matching and serve up an error page if a match is found. The logs are pretty easy to grep to find occurences of a match and hence track the data down.
    • Palisade Systems [palisadesys.com] offers just such an appliance. Notably, it's built on top of FOSS, easy to install in many configurations, scales very well, and easy to administer (with a kickass web-based interface). It has a swiss-army set of tools you can get with it, including URL filtering, Credit Card matching, and other sensitive data matching. Full Disclosure: I work for Palisade Systems.

  • by halcyon1234 ( 834388 ) <halcyon1234@hotmail.com> on Thursday June 29, 2006 @12:05AM (#15625738) Journal
    Do nothing.

    Given enough time, some industrious hacker will find all the data for you.

    Then, when you read the Slashdot article titled "[Name of Your Company] Leaks Private Data", you'll know exactly where the pertinent files are.

    At that point you can take care of them. The pay out to the privacy lawsuites will probably end up being less than the cost in man hours to do the job semi-manually. In the end, you'll still come out on top. (Though there is the off-chance that your company and your replacement will come out on top...)

    • Do nothing. Given enough time, some industrious hacker will find all the data for you.

      I think the OP may be hoping for that, since they're posting on Slashdot and have disclosed the identity of the university just as cleverly as any redacted PDF would.

      • Maybe it is actualy an elaberate honeypot designed to initiate a sting that hopes to capture most of the worlds hackers in one swoop.

        So, hoping for that might be exactly what he(they) wants. The man will strike hard.
    • [Name of your company] == University of Conneticut by any chance? or is that too much of a coincidence?
    • I don't know why this is rated funny 'cause this is precisely what many (hell, most!) companies use as their policy today. Just ask any serious security professional and they will tell you the same.
  • by wwest4 ( 183559 ) on Thursday June 29, 2006 @12:06AM (#15625743)
    JIHS [ihackstuff.com] comes to mind.
  • by dbIII ( 701233 ) on Thursday June 29, 2006 @01:05AM (#15625926)
    One amusing situation was when the head of Australia's nuclear agency was very vocal in his criticism of google's sattellite images due to a low detail image of his facility being visable there - he actually played the "terrorism" card in his criticism. The front page of his organisations website had a much more detailed aerial photograph of the same facility that was more up to date.
  • by mrgodzilla ( 730416 ) on Thursday June 29, 2006 @01:14AM (#15625954) Homepage
    Dude.. we know who you work for.. really.
  • Dear Sir... (Score:5, Funny)

    by megaditto ( 982598 ) on Thursday June 29, 2006 @01:44AM (#15626019)
    Our Nigerian IT minister has tasked us with providing free support to the US universities.

    Kindly forward us the backup tapes with your data as well as a representative list of personal data you are striving to secure (such as student SS#, birth dates, Mother Maiden Names, corporate purchase cards, etc.) and we will promptly perform the audit for you.
    This is absolutely legal, and you will be allowed to keep 10% of whatever we find.

    [no, no it's a joke, dammit!]
  • SQL Server backups (Score:3, Interesting)

    by Centurix ( 249778 ) <centurix.gmail@com> on Thursday June 29, 2006 @04:38AM (#15626455) Homepage
    If you're familiar with SQL server and it's method of creating backup files you can actually find quite a number of backup files just using Google. The files are documented in the Microsoft Tape Format guide showing the block magic numbers which can be quite useful.

    Like this [google.com.au]

    Download, restore, maybe find something useful...
  • by Anonymous Coward
    Have all students put their credit card numbers, SSNs and mother's maiden names in a database. Then you can grep -v your web content. Done!
  • My company has a vectorspace engine that can help you classify docs that are related. given a SQL query you should be able to find related information. We'd be happy to help you build something, or help you through the build process. It works under windows, linux, and we just completed eSeries, iSeries and zSeries certification through IBM's chiphopper program (we haven't updated the website yet). Click through on my website link for more info.
  • Its expensive, complex, and will take at least a week to set up, but one of these [f5.com] will scrub all traffic for things like SSNs and other pattern-matchable data inside HTTP packets and other TCP traffic.
  • One tip (Score:1, Insightful)

    by Anonymous Coward
    is to stop using Social Security numbers. Another is to stop using Social Security numbers. Yet another is to stop using Social Security numbers. And yet another is to stop using Social Security numbers.

    Is your university contributing to the students' Social Security accounts for some unknown reason? If not, there's no legitimate reason for the school to continue to use students' Social Security information.

    Same with birth dates. In grade school, along with permanent records, we were assigned a s
    • stop using Social Security Numbers.

      That will work great, assuming everybody pays their own tuition. Because if anybody wants any Title IV aid (grants, student loans, etc.), pretty much the first entry that has to be submitted on every form required by, and every record transmitted to and from the government is ...wait for it... the SSN. Anybody got an Act of Congress handy? :)
    • SSN's are essential for extending credit (credit reporting), which most universities do. They are also needed for accessing financial aid (VA, Federal Student Loans, etc).
  • You could use your in house search engine (or a google appliance if your lucky) to find any existing content or I supose your current system of crawling, parsing and regexes would suffice.

    Then I would recoment the mod_security module for apache http://www.modsecurity.org/ [modsecurity.org] It will scan any POST requests for banned pattern. You could leverage the regexes you already wrote to scan the content in the first place.

    I think mod_secrity does what the FS and McAfee appliances do at much better (free as in beer) pri
  • If you have access to a MacOS X box, Anthracite Web Mining Desktop toolkit http://www.metafy.com/ [metafy.com] can do this kind of work for you. It's currently being used by customers on four continents to build daily custom reports from large volumes of web based data, like the SEC Edgar filings. It's based on a visual user interface that allows non-programmers to quickly and easily create high value web data processing systems. If you need to automate running a grip of regexen against thousands of webpages daily, you
  • I too work at a large university. I don't know if your experience is similar to mine. If it is, then given you're even posing this question I bet your university cannot formally define what is considered restricted or sensitive data. Some things are easy, like SSN. Some things are not. There are lots of grey areas. There are lots of kinds of data at a university, and there are potentially dozens or more formal audit requirements that might need to be met in some cases, but not others. Sometimes a gi

System checkpoint complete.