English | 简体中文
Mother and baby products are a popular shopping category on Taobao. With the implementation of the national two-child and three-child policies, the sales of mother and baby products will further increase. At the same time, the parenting concepts of the younger generation of parents have also undergone significant changes, driving the diversified development of China's mother and baby e-commerce market. As a result, major mother and baby brands have launched more intense competition, and more and more brands have seen potential business opportunities in the industry, joining the mother and baby e-commerce field, making industry competition increasingly fierce. Based on this, this project will conduct a visual analysis of the "Taobao Mother and Baby Shopping" dataset to better mine data insights.
The data for this analysis comes from the Taobao Mother and Baby Shopping Behavior Dataset, with field adjustments made on the original data, including the following two tables:
Field | Description | Note |
---|---|---|
user_id | User ID | Sample & Field Anonymization |
birthday | Baby's Birth Date | Filled by user_id, may not be accurate, format: YYYYMMDD |
gender | Baby's Gender | (0 Boy, 1 Girl, 2 Unknown), filled by user_id, may not be accurate |
Field | Description | Note |
---|---|---|
user_id | User ID | |
auction_id | Transaction ID | |
category_1 | Product Primary Category ID | |
category_2 | Product Secondary Category ID | |
buy_amount | Purchase Quantity | |
auction_id | Order ID |
Using the open-source components Airbyte
, ClickHouse
, and Superset
of the KDP platform to complete the following simple business analysis tasks, fully mining the value of the data through data analysis and visual display, making data better serve the business.
Install the following components on the KDP page and complete the component QuickStart:
In real projects, we usually need to schedule jobs. Here we use Airflow to achieve job scheduling.
We use the Airbyte component of the platform to synchronize data from the original data source (CSV file) to ClickHouse.
Add a file type source in Airbyte to read tianchi_mum_baby
tianchi_mum_baby
tianchi_mum_baby
(Do not modify)https://gitee.com/linktime-cloud/example-datasets/raw/main/airbyte/tianchi_mum_baby.csv
Add a file type source in Airbyte to read tianchi_mum_baby_trade_history
.
tianchi_mum_baby_trade_history
tianchi_mum_baby_trade_history
(Do not modify)https://gitee.com/linktime-cloud/example-datasets/raw/main/airbyte/tianchi_mum_baby_trade_history.csv
Add a ClickHouse type destination in Airbyte.
clickhouse
8123
default
default
ckdba.123
Add a connection in Airbyte. Select tianchi_mum_baby
as the source and ClickHouse as the destination, use the default configuration for the next step, set Schedule type to Manual
, and then save. Manually click the sync button to synchronize data.
Add a connection in Airbyte. Select tianchi_mum_baby_trade_history
as the source and ClickHouse as the destination, use the default configuration for the next step, set Schedule type to Manual
, and then save. Manually click the sync button to synchronize data.
Check the job status in Airbyte. If successful, it means the data has been successfully imported into ClickHouse.
After completing the above operations, the data has been successfully imported into ClickHouse. Next, we will use ClickHouse for data ETL processing.
After importing the data from the CSV file into ClickHouse, perform ETL processing on the data for visual analysis in Superset. The relevant code in Airflow is as follows, refer to Github or Gitee for specific code.
@task
def clickhouse_upsert_userinfo():
...
@task
def clickhouse_upsert_user_purchases():
...
@task
def clickhouse_upsert_dws_user_purchases():
...
Note: In data warehouse modeling, there are certain naming conventions and data layering, which are simply handled here.
In real projects, data changes over time, so job scheduling is needed to update the data. For basic use of Airflow, refer to the application's QuickStart.
In the data development chapter, an Airflow DAG is written, mainly involving the scheduling of two types of jobs:
Airbyte itself can be configured for scheduling, but to manage jobs uniformly and handle dependencies between jobs, Airflow is used for scheduling here. The following mainly introduces how to schedule Airbyte jobs through Airflow.
Add an HTTP type connection in Airflow.
airbyte
(This connection will be used in subsequent DAGs)HTTP
airbyte-airbyte-server-svc
8001
Airbyte Job Task The code is as follows:
user_info_to_clickhouse = AirbyteTriggerSyncOperator(
task_id='user_info_to_clickhouse',
airbyte_conn_id='airbyte',
connection_id='dd5ebac2-2827-460a-bc5b-d96b68eb5e8a', # Please use your Airbyte connection id
asynchronous=False,
timeout=3600,
wait_seconds=3
)
user_purchases_to_clickhouse = AirbyteTriggerSyncOperator(
task_id='user_purchases_to_clickhouse',
airbyte_conn_id='airbyte',
connection_id='1f37ca26-b381-4fb9-8b68-fa1e633cc505', # Please use your Airbyte connection id
asynchronous=False,
timeout=3600,
wait_seconds=3
)
airbyte_conn_id
is the ID of the HTTP type connection addedconnection_id
is the ID of the connection in Airbyte, which can be obtained from the URL. For example, if the URL of a connection is http://airbyte-kdp-data.kdp-e2e.io/workspaces/xyz/connections/abc
, the connection_id is abc
Please fork the code and update the
gitSync
configuration of thedags
in Airflow on KDP using your own git repository.
Refer to Using the Airbyte Operator to orchestrate Airbyte OSS for scheduling Airbyte jobs.
Enable the DAG named taobao-user-purchases-example
in Airflow, the job is scheduled daily, and the first time it is enabled, it will backfill the schedule for the previous day. It can also be triggered manually. Click the DAG name to view the job running status.
DAG List
DAG Running Status
Superset supports ClickHouse data sources, and we can use Superset for visual analysis of the data. It is recommended that users complete the KDP Superset App QuickStart first.
Log in to Superset http://superset-kdp-data.kdp-e2e.io
(note to add local Host resolution) with the account admin
and password admin
.
ckdba.123
for ClickHouse user default
Add Data Connection
Click Settings
- Databases Connections
- + Database
on the right side of the page to add a ClickHouse data source.
Fill in the following parameters:
HOST: clickhouse
PORT: 8123
DATABASE NAME: default
USERNAME: default
PASSWORD: ckdba.123
Add DATASET
Click DATASETS
on the top navigation of the page - + Dataset
on the top right to add a dataset.
Then refer to the Superset official documentation to create charts
50022520
; boys' purchase volume in category 50014815
far exceeds that of girls, being more than double.50018831
within the primary category 50014815
, which accounts for 64% of the sales in this primary category.Machine Learning Additionally, we can use machine learning to predict user purchasing behavior and make recommendations. For users who did not purchase as expected, we can push coupons to attract them to buy. To Be Continued...