Data Engineer

Main Purpose of the Job
A Data Engineer utilizes his/her information modeling and programming skills to clean, prepare, and optimize data for consumption by Data Scientists or UX/Visual experts to derive insights. They combine cognitive computing and advanced analytics technologies, such as modifying open-source tools to incorporate cognition, with traditional data engineering and apply them to data sourced for specific engagements.

S/He is a software engineer who designs, builds, and integrates data from various sources, and writes and manages complex queries, ensuring that data is easily accessible and operates smoothly, with the goal of optimizing the performance of the company’s data ecosystem.

Potential resources should have a strong data management background, with experience handling unstructured and structured data, and the ability to transform and analyze data using various tools or scripting.

If the business case is proven for initial proof of value/proof of concept (POV/POC) engagements, s/he also collaborates with IT Solution Architects to embed the pathfinder value-generating and successful models into operations and help design them as key components in industrialized solutions.

Key Outputs

Contribution to IT Strategy by facilitating exploration through POC/POV and key initiatives.

Articulates a vision and roadmap for the ingestion, cleansing, staging, harmonization, and exploitation of data as a valued corporate asset, in alignment with existing functional priorities, to help Product Managers explore new ways to solve complex business problems
Works on data requirements (provided by UX and Data Scientists) that will be used to train and develop models and algorithms to solve business challenges
As part of a POC/POV, creates data ingestion strategies, prepares data, assists in variable creation, develops information models or data staging strategies, and performs necessary data cleansing activities
Manages the data lifecycle during the POC/POV and starts creating strategies to embed them into an industrialized model or service operations
Ensures data is managed in a secure and compliant way, even during the POC/POV, to avoid potential risks
Works with lead markets, functions, and GMB/RMB to conduct the POC/POV and bring it to closure

Operational Effectiveness and Efficiency by helping industrialize proven models

Supports product teams and Solution Architects in industrializing information models proven during the POC/POV by devising data collection procedures that include relevant information for building analytic systems
Assists Solution Architects in developing processes and tools to continuously monitor information model performance and data accuracy
Helps technical specialists design better descriptive and prescriptive analytics solutions by providing the foundation for semantic models that can be used to visualize information and develop reports on data analysis results to facilitate new KPI/PPI discussions
Assists technical specialists in API/interfacing technologies to better understand how to acquire data and build ingestion layers for industrialized information models
Promotes the use of services rather than full automation where manual intervention is more appropriate based on cost-benefit analysis

Stakeholder Engagement

Influences information architects on what should be part of the company’s core data assets and what has repeat value
Shares best practices with analytics and product teams and facilitates market enablement for similar initiatives
Collaborates with stakeholders across the organization to identify opportunities to leverage company data to drive business solutions and provide data sourcing advisory
Influences product teams, including Solution Architects, through presentations of data-based recommendations for evolving operational solutions with new and enhanced models, including effective semantic models and API connectors
Champions best practices for data management across delivery and recipient organizations

Key Experiences

(Bachelors or Master’s degree, PhD) in Computer Science, Engineering, or Management Information Systems
5+ years of experience in information modeling and data engineering
Ability to architect highly scalable distributed systems using open-source tools and big data technologies (such as Hadoop, HBase, Spark, Impala, Storm, etc.) integrated with other open-source or proprietary tools available through the Azure Marketplace, especially Cortana Intelligence components
Experience in cloud-based agile and DevOps environments with PaaS and IaaS
Experience using big data batch and streaming tools
Experience with SQL, NoSQL, relational database design (SAP HANA is a plus), efficient data retrieval methods, and data preparation/wrangling both on demand and in industrialized environments
Ability to gather and process raw data at scale (including writing scripts, web scraping, calling APIs, writing SQL queries, etc.)
Programming experience in Python, Scala, R, Java, and SQL (PowerShell and C# are an advantage)
Experience with basic and advanced data visualization: simple displays (e.g., Hue), use of notebooks (e.g., Jupyter, Zeppelin), and building reports and dashboards (e.g., Power BI, SAP BO suite)
Demonstrated ability to work with minimal supervision
Strong problem-solving skills with an emphasis on product development
Effective communication skills across different organizational levels and proficiency in English
Experience working in a global environment and with virtual teams