Skip to content

Commit c0bb6b6

Browse files
committed
Bulk TTL modifcation
Added ability to bulk modify TTL
1 parent c07abc9 commit c0bb6b6

File tree

4 files changed

+370
-0
lines changed

4 files changed

+370
-0
lines changed
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
## Using Glue Modify TTL Example
2+
This example provides scala script for modifying TTL across many or all rows in a table.
3+
4+
## Prerequisites
5+
* Setup Spark Cassandra connector using provided [setup script](../)
6+
7+
### Setup Export to S3
8+
The following script sets up AWS Glue job to modify ttl on a table. The script takes the following parameters
9+
* PARENT_STACK_NAME is the stack name used to create the spark cassandra connector with Glue. [setup script](../)
10+
* TLL_STACK_NAME is the stack name used to create glue job.
11+
* KEYSPACE_NAME and TABLE_NAME Keyspaces and table is the fully qualified name of the table you wish to modify.
12+
* TTL_FIELD the field used as existing TTL value
13+
* TTL_TIME_TO_ADD the amount of time to add to the existing ttl value.
14+
15+
```shell
16+
./setup-modify-ttl.sh SETUP_STACK_NAME TLL_STACK_NAME KEYSPACE_TABLE TABLE_NAME TTL_FIELD TTL_TIME_TO_ADD
17+
18+
```
19+
20+
### Running the script from the CLI
21+
22+
Running the job can be done through the AWS CLI. In the following example the command is running the job created in the previous step, but overrides the number of glue workers, worker type, and script arguments such as the table name. You can override any of the glue job parameters at run time and the default arguments.
23+
24+
```shell
25+
aws glue start-job-run --job-name AmazonKeyspacesModifyTTL-aksglue-aksglue-export --number-of-workers 8 --worker-type G.2X --arguments '{"--TABLE_NAME":"keyvalue"}'
26+
```
27+
28+
Full list of aws cli arguments [start-job-run arguments](https://docs.aws.amazon.com/cli/latest/reference/glue/start-job-run.html)
29+
30+
### List of arguments
31+
32+
| argument | defenition | default |
33+
| :---------------- | :---------------------------------------------- | ----: |
34+
| --KEYSPACE_NAME | Name of the keyspace of the table to export | none |
35+
| --TABLE_NAME | Name of the table to export | none |
36+
| --TTL_FIELD | Name of the field to use ttl value | none |
37+
| --TTL_TIME_TO_ADD | Amount of time to modify the existig ttl | node |
38+
39+
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
AWSTemplateFormatVersion: 2010-09-09
2+
Description: ' Modify TTL Glue Job for Amazon Keyspaces'
3+
Parameters:
4+
KeyspaceName:
5+
NoEcho: false
6+
Description: Cassandra Keyspace name
7+
Type: String
8+
Default: mykeyspace
9+
MinLength: 3
10+
MaxLength: 48
11+
TableName:
12+
NoEcho: false
13+
Description: Cassandra Table name
14+
Type: String
15+
Default: mytable
16+
MinLength: 3
17+
MaxLength: 48
18+
ParentStack:
19+
NoEcho: false
20+
Description: Stack used to setup the spark cassandra connector
21+
Type: String
22+
Default: aksglue1
23+
MinLength: 3
24+
MaxLength: 48
25+
TTLField:
26+
NoEcho: false
27+
Description: The field to modify the TTL for
28+
Type: String
29+
Default: ttl
30+
MinLength: 3
31+
MaxLength: 48
32+
TTLTimeToAdd:
33+
NoEcho: false
34+
Description: The time to add to the TTL
35+
Type: String
36+
Default: 2592000 # 30 days
37+
Resources:
38+
GlueJob:
39+
Type: AWS::Glue::Job
40+
Properties:
41+
Command:
42+
Name: glueetl
43+
ScriptLocation: !Sub
44+
- "s3://${IMPORTBUCKETNAME}/scripts/${ParentStack}-${AWS::StackName}-modify-ttl.scala"
45+
- IMPORTBUCKETNAME:
46+
Fn::ImportValue:
47+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
48+
DefaultArguments:
49+
"--job-language": "scala"
50+
"--user-jars-first": "true"
51+
"--extra-jars": !Sub
52+
- 's3://${IMPORTBUCKETNAME}/jars/spark-cassandra-connector-assembly_2.12-3.1.0.jar,s3://${IMPORTBUCKETNAME}/jars/aws-sigv4-auth-cassandra-java-driver-plugin-4.0.9-shaded.jar,s3://${IMPORTBUCKETNAME}/jars/spark-extension_2.12-2.8.0-3.4.jar,s3://${IMPORTBUCKETNAME}/jars/amazon-keyspaces-helpers-1.0-SNAPSHOT.jar'
53+
- IMPORTBUCKETNAME:
54+
Fn::ImportValue:
55+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
56+
"--extra-files": !Sub
57+
- 's3://${IMPORTBUCKETNAME}/conf/keyspaces-application.conf'
58+
- IMPORTBUCKETNAME:
59+
Fn::ImportValue:
60+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
61+
"--enable-metrics": "true"
62+
"--enable-continuous-cloudwatch-log": "true"
63+
"--enable-spark-ui": "true"
64+
"--spark-event-logs-path": !Sub
65+
- "s3://${IMPORTBUCKETNAME}/spark-logs/"
66+
- IMPORTBUCKETNAME:
67+
Fn::ImportValue:
68+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
69+
"--write-shuffle-files-to-s3": "true"
70+
"--write-shuffle-spills-to-s3": "true"
71+
"--TempDir": !Sub
72+
- 's3://${IMPORTBUCKETNAME}/shuffle-space/modify-ttl-sample/'
73+
- IMPORTBUCKETNAME:
74+
Fn::ImportValue:
75+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
76+
"--FORMAT": !Sub '${FORMAT}'
77+
"--KEYSPACE_NAME": !Sub '${KeyspaceName}'
78+
"--TABLE_NAME": !Sub '${TableName}'
79+
"--DRIVER_CONF": "keyspaces-application.conf"
80+
"--TTL_FIELD": !Sub '${TTLField}'
81+
"--TTL_TIME_TO_ADD": !Sub '${TTLTimeToAdd}'
82+
#"--DISTINCT_KEYS": "id,create_date"
83+
"--class": "GlueApp"
84+
#Connections:
85+
# ConnectionsList
86+
Description: 'modify ttl'
87+
#ExecutionClass: String
88+
#ExecutionProperty:
89+
#ExecutionProperty
90+
GlueVersion: "3.0"
91+
#LogUri: String
92+
#MaxCapacity: Double
93+
#MaxRetries: Double
94+
Name: !Sub ['AmazonKeyspacesModifyTTL-${STACKNAME}', STACKNAME: !Join [ "-", [!Ref ParentStack, !Ref AWS::StackName]]]
95+
#NonOverridableArguments: Json
96+
#NotificationProperty:
97+
#NotificationProperty
98+
NumberOfWorkers: 2
99+
Role:
100+
Fn::ImportValue:
101+
!Sub 'KeyspacesGlueJobServiceRoleExport-${ParentStack}'
102+
#SecurityConfiguration: String
103+
#Tags: Json
104+
#Timeout: Integer
105+
WorkerType: G.2X
106+
Outputs:
107+
KeyspacesGlueJobName:
108+
Description: Glue job id
109+
Value: !Sub ['AmazonKeyspacesModifyTTL-${STACKNAME}', STACKNAME: !Join [ "-", [!Ref ParentStack, !Ref AWS::StackName]]]
110+
Export:
111+
Name: !Sub ['KeyspaceExportJobName-${STACKNAME}', STACKNAME: !Ref AWS::StackName]
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
import com.amazonaws.services.glue.GlueContext
2+
import com.amazonaws.services.glue.util.GlueArgParser
3+
import com.amazonaws.services.glue.util.Job
4+
import org.apache.spark.SparkContext
5+
import org.apache.spark.SparkConf
6+
import org.apache.spark.sql.Dataset
7+
import org.apache.spark.sql.Row
8+
import org.apache.spark.sql.SaveMode
9+
import org.apache.spark.sql.SparkSession
10+
import org.apache.spark.sql.functions.from_json
11+
import org.apache.spark.sql.streaming.Trigger
12+
import scala.collection.JavaConverters._
13+
import com.datastax.spark.connector._
14+
import org.apache.spark.sql.cassandra._
15+
import org.apache.spark.sql.SaveMode._
16+
import com.datastax.spark.connector._
17+
import com.datastax.spark.connector.cql._
18+
import com.datastax.oss.driver.api.core._
19+
import org.apache.spark.sql.functions.rand
20+
import com.amazonaws.services.glue.log.GlueLogger
21+
import java.time.ZonedDateTime
22+
import java.time.ZoneOffset
23+
import java.time.temporal.ChronoUnit
24+
import java.time.format.DateTimeFormatter
25+
import org.apache.spark.sql.functions._
26+
import org.apache.spark.sql.Row
27+
28+
29+
object GlueApp {
30+
31+
//currentTTL is the time left on the record
32+
//timeToAdd time the delta add or subtract. Use negative number for subtraction.
33+
def addTimeToExistingTTL(currentTTL: Int, timeToAdd: Int): Int = {
34+
35+
val finalTTLValue = currentTTL + timeToAdd;
36+
37+
// Scenario where the future ttl is less than the remaininng TTL.
38+
// Moving from 60 to 90 days.
39+
// TODO: May be more efficient to just delete, than modify/expire
40+
Math.max(1, finalTTLValue)
41+
}
42+
43+
//update the row with the new ttl using LWT
44+
//to update the ttl we must overwrite using the same row values
45+
//Using LWT to check the value has not changed since reading the row for the current ttl.
46+
def updateRowWithLWT(row: Row, connector: CassandraConnector): Unit = {
47+
//open seach creates a session or updates a reference counter on shared session.
48+
val session = connector.openSession()
49+
50+
val query =
51+
"""UPDATE tlp_stress.keyvalue
52+
|USING TTL ?
53+
|SET value = ?
54+
|WHERE key = ?
55+
|IF value = ?""".stripMargin
56+
57+
//prepared statmeents are cached by the driver, and not an issue if called multiple times.
58+
val prepared = session.prepare(query)
59+
60+
val key = row.getAs[String]("key")
61+
val value = row.getAs[String]("value")
62+
val expectedValue = row.getAs[String]("value")
63+
val ttl = row.getAs[Int]("ttlCol")
64+
65+
//bind the values to the prepared statement.
66+
val bound = prepared.bind(
67+
java.lang.Integer.valueOf(ttl),
68+
value, key, expectedValue)
69+
70+
val result = session.execute(bound)
71+
72+
// Optional: check whether LWT succeeded
73+
if (!result.wasApplied()) {
74+
println(s"Conditional update failed for id=$key")
75+
// Here you may want want to:
76+
//1. read the latest row and ttl
77+
//2. apply the correct ttl
78+
//3. use LWT to avoid conflicts
79+
}
80+
session.close()
81+
}
82+
83+
def main(sysArgs: Array[String]) {
84+
85+
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME", "KEYSPACE_NAME", "TABLE_NAME", "DRIVER_CONF", "TTL_FIELD", "TTL_TIME_TO_ADD").toArray)
86+
87+
val driverConfFileName = args("DRIVER_CONF")
88+
89+
val conf = new SparkConf()
90+
.setAll(
91+
Seq(
92+
("spark.task.maxFailures", "100"),
93+
94+
("spark.cassandra.connection.config.profile.path", driverConfFileName),
95+
("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions"),
96+
("directJoinSetting", "on"),
97+
98+
("spark.cassandra.output.consistency.level", "LOCAL_QUORUM"),//WRITES
99+
("spark.cassandra.input.consistency.level", "LOCAL_ONE"),//READS
100+
101+
("spark.cassandra.sql.inClauseToJoinConversionThreshold", "0"),
102+
("spark.cassandra.sql.inClauseToFullScanConversionThreshold", "0"),
103+
("spark.cassandra.concurrent.reads", "50"),
104+
105+
("spark.cassandra.output.concurrent.writes", "5"),
106+
("spark.cassandra.output.batch.grouping.key", "none"),
107+
("spark.cassandra.output.batch.size.rows", "1"),
108+
("spark.cassandra.output.batch.size.rows", "1"),
109+
("spark.cassandra.output.ignoreNulls", "true")
110+
))
111+
112+
113+
val spark: SparkContext = new SparkContext(conf)
114+
val glueContext: GlueContext = new GlueContext(spark)
115+
val sparkSession: SparkSession = glueContext.getSparkSession
116+
117+
import sparkSession.implicits._
118+
119+
Job.init(args("JOB_NAME"), glueContext, args.asJava)
120+
121+
val logger = new GlueLogger
122+
123+
//validation steps for peers and partitioner
124+
val connector = CassandraConnector.apply(conf);
125+
val session = connector.openSession();
126+
val peersCount = session.execute("SELECT * FROM system.peers").all().size()
127+
128+
val partitioner = session.execute("SELECT partitioner from system.local").one().getString("partitioner")
129+
130+
logger.info("Total number of seeds:" + peersCount)
131+
logger.info("Configured partitioner:" + partitioner)
132+
133+
if(peersCount == 0){
134+
throw new Exception("No system peers found. Check required permissions to read from the system.peers table. If using VPCE check permissions for describing VPCE endpoints. https://docs.aws.amazon.com/keyspaces/latest/devguide/vpc-endpoints.html")
135+
}
136+
137+
if(partitioner.equals("com.amazonaws.cassandra.DefaultPartitioner")){
138+
throw new Exception("Sark requires the use of RandomPartitioner or Murmur3Partitioner. See Working with partioners in Amazon Keyspaces documentation. https://docs.aws.amazon.com/keyspaces/latest/devguide/working-with-partitioners.html")
139+
}
140+
141+
val tableName = args("TABLE_NAME")
142+
val keyspaceName = args("KEYSPACE_NAME")
143+
val backupS3 = args("S3_URI")
144+
val backupFormat = args("FORMAT")
145+
146+
val tableDf = sparkSession.read
147+
.format("org.apache.spark.sql.cassandra")
148+
.options(Map( "table" -> tableName,
149+
"keyspace" -> keyspaceName,
150+
"pushdown" -> "false"))//set to true when executing against Apache Cassandra, false when working with Keyspaces
151+
.load()
152+
//.filter("my_column=='somevalue' AND my_othercolumn=='someothervalue'")
153+
154+
// Register the UDF for calculating TTL
155+
val calculateTTLUDF = udf((currentTTL: Int, timeToAdd: Int) => addTimeToExistingTTL(currentTTL, timeToAdd))
156+
157+
val timeToAdd = args("TTL_TIME_TO_ADD").toInt
158+
val ttlField = args("TTL_FIELD")
159+
// val timeToAdd = 5 * 365 * 24 * 60 * 60 //add 5 years
160+
//val timeToAdd = -1 * 365 * 24 * 60 * 60 //subtract 1 year
161+
// Calculate TTL values
162+
val tableDfWithTTL = tableDf
163+
.withColumn("ttlCol", calculateTTLUDF(ttl(col(ttlField)), lit(timeToAdd)))
164+
165+
tableDfWithTTL.foreachPartition { partition: Iterator[Row] =>
166+
partition.foreach { row => updateRowWithLWT(row, connector) }
167+
}
168+
169+
Job.commit()
170+
}
171+
}
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/bash
2+
3+
echo "Positional Arguments: PARENT_STACK_NAME, STACK_NAME, KEYSPACE_NAME, TABLE_NAME, S3_URI, FORMAT"
4+
echo ""
5+
echo "PARENT_STACK_NAME: Stack name used for setting up the connector"
6+
echo "STACK_NAME: Stack name used for setting up glue job"
7+
echo "KEYSPACE_NAME: Keyspace to export from"
8+
echo "TABLE_NAME: Table to export from"
9+
echo "TTL_FIELD: Field to modify the TTL for"
10+
echo "TTL_TIME_TO_ADD: Time to add to the TTL"
11+
12+
PARENT_STACK_NAME=${1:-aksglue}
13+
STACK_NAME="${2:-$PARENT_STACK_NAME-export}"
14+
KEYSPACE_NAME=${3:-mykeyspace}
15+
TABLE_NAME=${4:-mytable}
16+
TTL_FIELD=${5:-ttl}
17+
TTL_TIME_TO_ADD=${6:-2592000} # 30 days
18+
19+
echo "Parent stack used: ${PARENT_STACK_NAME}"
20+
echo "Stack name used: ${STACK_NAME}"
21+
echo "Keyspace used used: ${KEYSPACE_NAME}"
22+
echo "Table used: ${TABLE_NAME}"
23+
echo "TTL Field used: ${TTL_FIELD}"
24+
echo "TTL Time to add: ${TTL_TIME_TO_ADD}"
25+
26+
if ! command -v aws &> /dev/null; then
27+
echo "AWS CLI \"aws\" is not installed. aws is required for deploying artifacts to s3. See https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html "
28+
exit 1
29+
fi
30+
31+
export KEYSPACES_GLUE_BUCKET=$(aws cloudformation describe-stacks --query "Stacks[?StackName==\`$PARENT_STACK_NAME\`][].Outputs[?ExportName==\`KeyspacesBucketNameExport-$PARENT_STACK_NAME\`]".OutputValue --output text)
32+
33+
if [ -z "${KEYSPACES_GLUE_BUCKET}" ]; then
34+
echo "Parent stack not found. Cloudformation Export not found KeyspacesBucketNameExport-$PARENT_STACK_NAME"
35+
exit 1
36+
fi
37+
38+
echo "Moving script to bucket ${KEYSPACES_GLUE_BUCKET}"
39+
40+
aws s3api put-object --bucket $KEYSPACES_GLUE_BUCKET --key scripts/$PARENT_STACK_NAME-$STACK_NAME-export.scala --body export-sample.scala || exit 1
41+
42+
aws cloudformation create-stack --stack-name ${STACK_NAME} --parameters ParameterKey=ParentStack,ParameterValue=$PARENT_STACK_NAME ParameterKey=KeyspaceName,ParameterValue=$KEYSPACE_NAME ParameterKey=TableName,ParameterValue=$TABLE_NAME ParameterKey=TTLField,ParameterValue=$TTL_FIELD ParameterKey=TTLTimeToAdd,ParameterValue=$TTL_TIME_TO_ADD --template-body 'file://glue-job-modify-ttl.yaml' || exit 1 #--debug
43+
44+
echo Waiting for CloudFormation stack to complete ...
45+
aws cloudformation wait stack-create-complete --stack-name ${STACK_NAME} || exit 1
46+
47+
aws cloudformation describe-stacks --stack-name $STACK_NAME --query "Stacks[0].Outputs" || exit 1
48+
49+

0 commit comments

Comments
 (0)