Job description:
Responsibilities:
Export data from the Hadoop ecosystem to ORC or Parquet file
Build scripts to move data from on-prem to GCP
Build Python/PySpark pipelines
Transform the data as per the outlined data model
Proactively improve pipeline performance and efficiency
'Must Have' Experience:
4+ years of Data Engineering work experience(ETL, SSIS, SSRS)
2+ years of building Python/PySpark pipelines
2+ years working with Hadoop/Hive
4+ years of experience with SQL
Any cloud experience AWS, Azure, GCP (GCP Desired)
Experience with Data Warehousing & Data Lake
Understanding of Data Modeling
Understanding of data files format like ORC, Parquet, Avro
'Nice to Have' Experience:
Google experience Cloud Storage, Cloud Composer, Dataproc & BigQuery
Experience using Cloud Warehouses like BigQuery (preferred), Amazon Redshift, Snowflake
etc.
Working knowledge of Distributed file systems like GCS, S3, HDFS etc.
Understanding of Airflow / Cloud Composer
CI/CD and DevOps experience
ETL tools e.g., Informatica (IICS) Ab Initio, Infoworks, SSIS