Data Engineer PySpark Data Bricks Session Day 6

Video Player is loading.
Current Time 0:00
Duration 54:08
Loaded: 0.00%
Stream Type LIVE
Remaining Time 54:08
 
1x
29 Views
Published
???? ???????????????????? ???? ???????????????????? ???????????????????????????? : Set up the PySpark environment.
???? ???????????????????????? ???? ???????????????? : Define the list with three elements.
???? ???????????????????????????????????????????? ???????????? ???????????????? : Distribute the list across the cluster nodes.
???? ???????????????????????????? ???????? ???????????????????????????????????? : Convert the distributed RDD to a DataFrame.
???? ???????????????????????????? ???????????????????????????????????????? : Show the contents and perform any desired operations.

???? This video will explain how to write first program in PySpark.

???? Video Link: https://youtu.be/CFMvb0caNLk

LinkedIn Profile of author:
https://www.linkedin.com/in/sachin-saxena-graphic-designer/

Code Source Link:
https://lnkd.in/g67a4kY3

???????????????????????????????????????????? ???????? ???????????? ???????????????? :

????. ???????????????????? ???????????????????????????? : The SparkSession is created to provide an entry point for Spark functionality.
????. ???????????????? ???????????????????????????????? : A list of three elements is defined.
????. ???????????????????????????????????????????? : The list is parallelized with numSlices=3, which ensures that each element is assigned to a different partition in the RDD. This is how we can distribute it across the three nodes.
????. ???????????????????????????? ???????? ???????????????????????????????????? : The RDD is mapped to a tuple format to convert it into a DataFrame. The column is named "element".
????. ???????????????????????????? ???????????????????????????????????? : The contents of the DataFrame are printed using df.show(), which will display each element as a separate row.
????. ???????????????????? : The total number of elements is counted and printed.
????. ???????????????????????????? ???????????????????????????????????????? : An optional step is included to filter the DataFrame for elements containing "1" and display the result.
????. ???????????????? ???????????????????? ???????????????????????????? Finally, the Spark session is stopped to release resources.


1:22 # Databricks notebook source
2:56 Upload CSV file over Workspace
3:54 Databricks source
6:00 Show the number of students in the file
6:55 withcolomn in PySpark
7:38 schema Databricks notebook source
11:00 create custom data
20:11 lit command in PySpark
26:00 renamed multiple columns in single line using withColumnRenamed
27:55 alias name of any column
31:00 Filter rows as SQL query in PySpark
32:00 select * from student where course in ['DB', 'Cloud','OOP'] is in method
35:00 select * from student where course in ['DB', 'Cloud','OOP']
36:00 course_value= ['DB', 'Cloud','OOP']
39:00 In SQL like operators
41:11 Search course with particular String Pattern
44:11 startswith in PySpark
44:33 endswith in PySpark
46:10 contains in PySpark
47:11 df.filter(df.name.like('%s%e%')).show() in PySpark
Show more
Sign in or sign up to post comments.
Be the first to comment