???? ???????????????????? ???? ???????????????????? ???????????????????????????? : Set up the PySpark environment.
???? ???????????????????????? ???? ???????????????? : Define the list with three elements.
???? ???????????????????????????????????????????? ???????????? ???????????????? : Distribute the list across the cluster nodes.
???? ???????????????????????????? ???????? ???????????????????????????????????? : Convert the distributed RDD to a DataFrame.
???? ???????????????????????????? ???????????????????????????????????????? : Show the contents and perform any desired operations.
???? This video will explain how to write first program in PySpark.
???? Video Link: https://youtu.be/CFMvb0caNLk
LinkedIn Profile of author:
https://www.linkedin.com/in/sachin-saxena-graphic-designer/
Code Source Link:
https://lnkd.in/g67a4kY3
???????????????????????????????????????????? ???????? ???????????? ???????????????? :
????. ???????????????????? ???????????????????????????? : The SparkSession is created to provide an entry point for Spark functionality.
????. ???????????????? ???????????????????????????????? : A list of three elements is defined.
????. ???????????????????????????????????????????? : The list is parallelized with numSlices=3, which ensures that each element is assigned to a different partition in the RDD. This is how we can distribute it across the three nodes.
????. ???????????????????????????? ???????? ???????????????????????????????????? : The RDD is mapped to a tuple format to convert it into a DataFrame. The column is named "element".
????. ???????????????????????????? ???????????????????????????????????? : The contents of the DataFrame are printed using df.show(), which will display each element as a separate row.
????. ???????????????????? : The total number of elements is counted and printed.
????. ???????????????????????????? ???????????????????????????????????????? : An optional step is included to filter the DataFrame for elements containing "1" and display the result.
????. ???????????????? ???????????????????? ???????????????????????????? Finally, the Spark session is stopped to release resources.
1:22 # Databricks notebook source
2:56 Upload CSV file over Workspace
3:54 Databricks source
6:00 Show the number of students in the file
6:55 withcolomn in PySpark
7:38 schema Databricks notebook source
11:00 create custom data
20:11 lit command in PySpark
26:00 renamed multiple columns in single line using withColumnRenamed
27:55 alias name of any column
31:00 Filter rows as SQL query in PySpark
32:00 select * from student where course in ['DB', 'Cloud','OOP'] is in method
35:00 select * from student where course in ['DB', 'Cloud','OOP']
36:00 course_value= ['DB', 'Cloud','OOP']
39:00 In SQL like operators
41:11 Search course with particular String Pattern
44:11 startswith in PySpark
44:33 endswith in PySpark
46:10 contains in PySpark
47:11 df.filter(df.name.like('%s%e%')).show() in PySpark
???? ???????????????????????? ???? ???????????????? : Define the list with three elements.
???? ???????????????????????????????????????????? ???????????? ???????????????? : Distribute the list across the cluster nodes.
???? ???????????????????????????? ???????? ???????????????????????????????????? : Convert the distributed RDD to a DataFrame.
???? ???????????????????????????? ???????????????????????????????????????? : Show the contents and perform any desired operations.
???? This video will explain how to write first program in PySpark.
???? Video Link: https://youtu.be/CFMvb0caNLk
LinkedIn Profile of author:
https://www.linkedin.com/in/sachin-saxena-graphic-designer/
Code Source Link:
https://lnkd.in/g67a4kY3
???????????????????????????????????????????? ???????? ???????????? ???????????????? :
????. ???????????????????? ???????????????????????????? : The SparkSession is created to provide an entry point for Spark functionality.
????. ???????????????? ???????????????????????????????? : A list of three elements is defined.
????. ???????????????????????????????????????????? : The list is parallelized with numSlices=3, which ensures that each element is assigned to a different partition in the RDD. This is how we can distribute it across the three nodes.
????. ???????????????????????????? ???????? ???????????????????????????????????? : The RDD is mapped to a tuple format to convert it into a DataFrame. The column is named "element".
????. ???????????????????????????? ???????????????????????????????????? : The contents of the DataFrame are printed using df.show(), which will display each element as a separate row.
????. ???????????????????? : The total number of elements is counted and printed.
????. ???????????????????????????? ???????????????????????????????????????? : An optional step is included to filter the DataFrame for elements containing "1" and display the result.
????. ???????????????? ???????????????????? ???????????????????????????? Finally, the Spark session is stopped to release resources.
1:22 # Databricks notebook source
2:56 Upload CSV file over Workspace
3:54 Databricks source
6:00 Show the number of students in the file
6:55 withcolomn in PySpark
7:38 schema Databricks notebook source
11:00 create custom data
20:11 lit command in PySpark
26:00 renamed multiple columns in single line using withColumnRenamed
27:55 alias name of any column
31:00 Filter rows as SQL query in PySpark
32:00 select * from student where course in ['DB', 'Cloud','OOP'] is in method
35:00 select * from student where course in ['DB', 'Cloud','OOP']
36:00 course_value= ['DB', 'Cloud','OOP']
39:00 In SQL like operators
41:11 Search course with particular String Pattern
44:11 startswith in PySpark
44:33 endswith in PySpark
46:10 contains in PySpark
47:11 df.filter(df.name.like('%s%e%')).show() in PySpark
Sign in or sign up to post comments.
Be the first to comment