Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Overcoming big data security challenges in cloud environments

Expert Dave Shackleford distills advice from the CSA on the most pressing big data security challenges for enterprises in cloud environments.

As a concept, big data is hardly new; many organizations have collected and used large quantities of data for decades. The idea of big data has truly taken off in recent years though, in large part because organizations of all sizes and budgets now have access to infrastructure via the cloud that enables big data opportunities. While new opportunities are great for business, it's still not clear whether many organizations are thinking about the security implications of big data projects.

In June, the Cloud Security Alliance (CSA) Big Data Working Group released its expanded "Top Ten Big Data Security and Privacy Challenges" document, which details the types of security and privacy issues facing large, diverse and less structured data sets (collectively dubbed big data) in cloud service environments. With all the hype behind big data today, what can enterprise consumers struggling with big data security issues take away from this report?

In this tip, we'll distill some of the document's findings on the top ten big data security challenges in cloud environments, with pointers provided on what organizations should do to ensure their big data implementations are secure.

Modeling the security risks

Organizations of all sizes and budgets now have access to infrastructure via the cloud that enables big data opportunities.

Before delving into the individual risks associated with big data in the cloud, one of the most immediately useful aspects of the CSA Big Data Working Group's effort is the breakdown of risks into a simple architectural model. The model outlines where the data is being processed and stored, and includes the big data sources, processing clusters and endpoint consumers of the data (systems, mobile devices, etc.), along with the cloud environments where processing and storage occurs. In addition, the model shows a simple directional flow of the data as it moves through this ecosystem, which can definitely be useful for enterprises looking to understand what big data really means to them in the context of cloud computing.

The working group also breaks the risks down into four categories: infrastructure security (secure computations and nonrelational data stores); data privacy (cryptography, access controls and privacy for analytics and data mining); data management (auditing and secure data storage, as well as provenance metadata [data source validation and trustworthiness]); and integrity and reactive security (endpoint validation and real-time security monitoring).

By utilizing these categories, enterprises can determine where the major risks fit into their existing security controls architecture.

Big data security challenges

To develop its documentation, the CSA working group interviewed CSA members and analyzed publications and trade journals, the result of which was ten top security and privacy changes associated with big data. In terms of the specific takeaways from the research, the following list will detail the key considerations that most organizations focus their efforts toward:

Secure computations in distributed programming frameworks. The first identified risk digs into the security of computational elements in frameworks such as MapReduce, with two specific security concerns outlined. First, the trustworthiness of the "mappers," which are the code that breaks data into pieces, analyzes it and outputs key-value pairs, needs to be evaluated. Second, data sanitization and de-identification capabilities need to be implemented to prevent the storage or leakage of sensitive data from the platform should be implemented through data sanitization and de-identification. Enterprises using complex tools such as MapReduce will need to use tools such as Mandatory Access Controls within SELinux and de-identifier routines to accomplish this; on the same note, enterprises should inquire as to how cloud providers are controlling and remediating this issue in their environments.

Security best practices for nonrelational data stores. The use of NoSQL and other large-scale, nonrelational data stores may create new security issues due to a possible lack of capabilities in several vital areas, including any real authentication, encryption for data at rest or in transit, logging or data tagging, and classification. Organizations need to consider the use of separate application or middleware layers to enforce authentication and data integrity. All passwords must be encrypted, and any connections to the system should ideally use Secure Sockets Layer/Transport Layer Security. Ensure logs are generated from all transactions around sensitive data as well.

Secure data storage and transactions logs. Data and transaction logs may be stored in multi-tiered storage media, but organizations need to defend against unauthorized access and ensure continuity and availability. Policy-based private key encryption can be used to ensure that only authenticated users and applications access the platform.

Endpoint input validation/filtering. In a big data implementation, numerous endpoints may submit data for processing and storage. To ensure only trusted endpoints are submitting data and that false or malicious data is not submitted, organizations need to vet each endpoint connecting to the corporate network. The working group does not have a practical set of suggestions for mitigating this concern, unfortunately, aside from the recommendation to incorporate the Trusted Platform Module chips (found in many newer endpoint devices) into the validation process where possible. Host-based and mobile device security controls could potentially alleviate the risk associated with untrusted endpoints, along with strong processes around system inventory tracking and maintenance.

Real-time security monitoring. Monitoring big data platforms, as well as performing security analytics, should be done in near real time. Many traditional security information and event management platforms cannot keep pace with the large quantity (and formats) of data in use within true big data implementations. Currently, little true monitoring of Hadoop and other big data platforms exists, unless database and other front-end monitoring tools are in use.

Scalable and composable privacy-preserving data mining and analytics. Big data implementations can lead to privacy concerns around data leakage and exposure. There are a number of security controls that can be put in place to help organizations deal with this problem, including the use of strong encryption for data at rest, access controls to data, and a separation of duty processes and controls to minimize the success of insider attacks.

Cryptographically enforced data-centric security. Historically, the popular approach to data control has been to secure the systems that manage the data, as opposed to the data itself. However, those applications and platforms have proven vulnerable time and again. The use of strong cryptography to encapsulate sensitive data in cloud provider environments, as well as new and innovative algorithms that more capably allow for key management and secure key exchange, are a more reliable method for managing access to data, especially as it exists in the cloud independent of any one platform.

From the editors: More guidance from the CSA

The Cloud Security Alliance provides guidance on numerous cloud security issues beyond big data, including a recent report on cloud service provider incident management and forensics. Based on that report, Dave Shackleford, who helps lead the Atlanta chapter of the CSA, provided enterprises with 10 questions that they could ask cloud providers concerning their forensics capabilities.

Granular access control. Enacting fine-grained access to big data stores such as NoSQL databases and the Hadoop Distributed File System requires the implementation of Mandatory Access Control and sound authentication. New NoSQL implementations such as Apache Accumulo can facilitate very granular access control to key-value pairs; cloud service providers should also be able to articulate the types of access controls that are in place in their environments.

Granular audits. In conjunction with continuous monitoring, regular audits and analysis of log and event data can help to detect intrusions or attack attempts within the big data environment. The key control to focus on here is logging at all layers within and surrounding the big data environment.

Data provenance. Provenance in this case is focused on data validation and trustworthiness. Authentication, end-to-end data protection and fine-grained access controls can help to verify and validate provenance in big data environments; cloud service providers should have these controls in place already to address other issues.


Big data collection and processing is performed within many cloud service provider environments in some fashion. While most consumer organizations may not have big data platforms and controls in place internally, it is critical to understand the major threats and risks posed to enterprise data within the cloud environment. By leveraging the work of the CSA working group for big data and focusing explicitly on the key controls that should be in place, enterprise consumers can help to properly evaluate the state of big data infrastructure and applications in their service providers' environments.

About the author:
Dave Shackleford is the owner and principal consultant of Voodoo Security LLC; lead faculty at IANS; and a SANS Institute analyst, senior instructor and course author. He has consulted with hundreds of organizations in the areas of security, regulatory compliance, and network architecture and engineering, and is a VMware vExpert with extensive experience designing and configuring secure virtualized infrastructures. He has previously worked as chief security officer for Configuresoft; chief technology officer for the Center for Internet Security; and as a security architect, analyst and manager for several Fortune 500 companies. Dave is the author of the Sybex book Virtualization Security: Protecting Virtualized Environments, as well as the co-author of Hands-On Information Security from Course Technology. Recently, Dave co-authored the first published course on virtualization security for the SANS Institute. He currently serves on the board of directors at SANS Institute and helps lead the Atlanta chapter of the Cloud Security Alliance.

Dig Deeper on Cloud Data Storage, Encryption and Data Protection Best Practices

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Good advice. Big data is definitely creating value for companies in numerous ways like cost savings and increased efficiency, but these benefits should not come at a loss of security and privacy of data. I work for McGladrey and there is a whitepaper on the website that offers useful information on security challenges of moving to the cloud that readers will find it interesting @ “Cloud risks striking a balance between savings and security” https://bit.ly/16uLsgi
There is a solution for endpoint input validation/filtering, which is software-defined networking. Only the devices authorised as part of the software-defined network can exchange data.