Data Engineer PySpark AWS

Fusemachines

kathmandu

Experience: More than 2 years

Source: Other Source

Key Skills: Data Modeling Data Architect Data Pipelines Apache Kafka Apache Spark

This job is expired 1 year, 1 month ago

Data Engineer PySpark AWS

Views: 237 | This job is expired 1 year, 1 month ago

Basic Job Information

Job Category	:	IT & Telecommunication
Job Level	:	Mid Level
No. of Vacancy/s	:	[ 1 ]
Employment Type	:	Full Time
Job Location	:	kathmandu
Apply Before(Deadline)	:	May. 17, 2024 15:45 (1 year, 1 month ago)

Job Specification

Experience Required	:	More than 2 years
Professional Skill Required	:	Data Modeling Data Architect Data Pipelines Apache Kafka Apache Spark

About the job

About The Role

This is a full-time position, responsible for designing, building, testing, optimizing and maintaining the infrastructure and code required for data integration, storage, processing, pipelines and analytics (BI, visualization and Advanced Analytics) from ingestion to consumption, implementing data flow controls, and ensuring high data quality and accessibility for analytics and business intelligence purposes. This role requires a strong foundation in programming and a keen understanding of how to integrate and manage data effectively across various storage systems and technologies.

We are looking for a skilled Data Engineer with a strong background in Python, SQL, Pyspark, and AWS cloud-based large-scale data solutions with a passion for data quality, performance, and cost optimization. The ideal candidate will develop in an Agile environment.

This role is perfect for an individual passionate about leveraging data to drive insights, improve decision-making, and support the strategic goals of the organization through innovative data engineering solutions.

Qualification & Experience

Must have a full-time Bachelor's degree in Computer Science Information Systems, Engineering, or a related field

At least 2 years of experience as a data engineer with strong expertise in Python, SQL, PySpark and AWS in an Agile environment, with a proven track record of building and optimizing data pipelines, architectures, and datasets, and proven experience in data storage, modeling, management, lake, warehousing, processing/transformation, integration, cleansing, validation and analytics

2+ years of experience with DevOps tools and technologies: GitHub or AWS DevOps

Proven experience delivering large scale projects and products for Data and Analytics, as a data engineer within AWS

Preferred previous experience working with retail or other similar data models

Following certifications:

AWS Certified Cloud Practitioner

AWS Certified Data Engineer - Associate

Nice to have:

Databricks Certified Associate Developer for Apache Spark

Databricks Certified Data Engineer Associate

Required Skills/Competencies

Strong programming Skills in one or more object-oriented languages such as Python (must have), Scala, Java, and proficiency in writing high-quality, scalable, maintainable, efficient, and optimized code for data integration, storage, processing, manipulation, and analytics solutions.

Strong SQL skills and experience working with complex data sets, Enterprise Data Warehouse, and writing advanced SQL queries. Proficient with Relational Databases (RDS, MySQL, Postgres, or similar) and nonSQL databases (Cassandra, MongoDB, Neo4j, etc.)

Strong analytic skills related to working with structured and unstructured datasets

Thorough understanding of big data principles, techniques, and best practices

Experience with scalable and distributed Data Processing Technologies such as Spark/PySpark (must have including Spark SQL) and Kafka, to be able to handle large volumes of data

Experience with stream-processing systems: Storm, Spark-Streaming, etc. is a plus

Experience in implementing data pipelines and efficient ELT/ETL processes, batch and real-time, in AWS and using open source solutions, being able to develop custom integration solutions as needed, including Data Integration from different sources such as APIs (PoS integrations is a plus), ERP (Oracle and Allegra are a plus), databases, flat files, Apache Parquet, event streaming, including cleansing, transformation and validation of the data

Experience in data cleansing, transformation, and validation

Understanding of Data Modeling and Database Design Principles. Being able to implement efficient database schemas that meet the requirements to support data solutions. With good understanding of dimensional data modeling

Knowledge in cloud computing specifically in AWS services related to data and analytics, such as S3, EMR, Glue, SageMaker, RDS, Redshift, Lambda, Kinesis, Lake Formation, EC2, ECS/ECR, EKS, IAM, CloudWatch, etc. implementing Data Warehousing, data lake and data lake house, solutions in AWS

Experience in Orchestration using technologies like Azkaban, Luigi, Airflow, etc.

Good understanding of BI solutions including Looker and LookML (Looker Modeling Language)

Familiar with advanced analytics, AI/ML services and tools, and the ability to integrate advanced analytics, machine learning, and AI capabilities into data solutions, nice to have

Strong understanding of the software development lifecycle (SDLC), especially Agile methodologies

Knowledge of SDLC tools and technologies, including project management software (Jira or similar), source code management (GitHub, AWS CodeCommit or similar), CI/CD system (GitHub actions, Jenkins, AWS CodePipeline or similar) and binary repository manager (Sonatype Nexus, AWS CodeArtifact or similar).

Knowledge and hands-on experience of DevOps principles, tools and technologies (GitHub and AWS DevOps) including continuous integration, continuous delivery (CI/CD), infrastructure as code (IaC – Terraform), configuration management, automated testing, performance tuning and cost management and optimization

Knowledge of data structures and algorithms and good software engineering practices

Strong analytical skills to identify and address technical issues, performance bottlenecks, and system failures

Proficiency in debugging and troubleshooting issues in complex data and analytics environments and pipelines

Understanding of Data Quality and Governance, including implementation of data quality and integrity checks and monitoring processes to ensure that data is accurate, complete, and consistent.

Good Problem-Solving skills: being able to troubleshoot data processing pipelines and identify performance bottlenecks and other issues.

Strong interpersonal skills and ability to work with a wide range of stakeholders

Excellent communication skills to collaborate with cross-functional teams, including business users, data architects, DevOps/DataOps/MLOps engineers, data analyst, data scientists, developers, and operations teams. Essential to convey complex technical concepts and insights to non-technical stakeholders effectively

Ability to document processes, procedures, and deployment configurations

Understanding of security practices, including network security groups, encryption, and compliance standards, and ability to implement security controls and best practices within data and analytics solutions, including proficient knowledge and working experience on various cloud security vulnerabilities and ways to mitigate them.

Self-motivated with the ability to work well in a team

Strong project management and organizational skills

A willingness to stay updated with the latest services, Data Engineering trends, and best practices in the field

Comfortable with picking up new technologies independently and working in a rapidly changing environment with ambiguous requirements

Care about architecture, observability, testing, and building reliable infrastructure and data pipelines

Responsibilities:

Design, implement, deploy, test and maintain highly scalable and efficient data architectures, defining and maintaining standards and best practices for data management independently with minimal guidance

Ensure systems meet business requirements and industry practices for data integrity, performance, and reliability

Integrate new data management technologies and software engineering tools into existing structures

Create custom software components and analytics applications

Employ a variety of languages and tools to marry systems together or try to hunt down opportunities to improve current processes

Evaluate and advise on technical aspects of open work requests in the data pipeline with the project team

Handle ELT/ETL processes, including data extraction, loading and transformation, from different sources ensuring consistency and quality

Transform and clean data for further analysis and storage

Design and optimize data models and schemas to support business requirements and analysis

Implement monitoring tools and systems to ensure the availability and performance of data systems.

Manage data security and access, ensuring confidentiality and integrity

Automate repetitive tasks and processes to improve operational efficiency

Collaborate with data science teams to establish pipelines and workflows for training, validation, deployment, and monitoring of machine learning models. Automate deployment and management of machine learning models in production environments

Contribute to data quality assurance efforts, such as implementing data validation checks and tests to ensure reliability, efficiency, accuracy, completeness and consistency of data

Test software solutions and meet product quality standards prior to release to QA

Ensure the reliability, scalability, and efficiency of data systems are maintained at all times. Identifying and resolving performance bottlenecks in pipelines due to data, queries and processing workflows to ensure efficient and timely data delivery

Work with DevOps teams to optimize resources

Assist in the configuration and management of data warehousing and data lake solutions

Collaborate closely with cross-functional teams including Product, Engineering, Data Scientists, and Analysts to thoroughly understand data requirements and provide data engineering support and extend the company’s data with third-party sources of information when needed

Takes ownership of storage layer, database management tasks, including schema design, indexing, and performance tuning

Evaluate and implement cutting-edge technologies and methodologies and continue learning and expanding skills in data engineering and cloud platforms, to improve and modernize existing data systems

Develop, design, and execute data governance strategies encompassing cataloging, lineage tracking, quality control, and data governance frameworks that align with current analytics demands and industry best practices working closely with Data Architect

Ensure technology solutions support the needs of the customer and/or organization

Define and document data engineering architectures, processes and data flows

Fusemachines is an Equal Opportunities Employer, committed to diversity and inclusion. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or any other characteristic protected by applicable federal, state, or local laws.

Powered by JazzHR

MHOTHIowd8

This job has expired.

Similar Jobs